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Description 



KJ 3 P ?t 6 ^ ' 0n re ' ateS t0 iden,ification or *e distinguishing of nucleic acids in samples. 
SISL- [ e ' at, , 0nSh, f belWeen stnJCture and function °f "nacromolecules is of fundamental importance in the un- 
derstanding of brological systems. These relationships are important to understanding, for example the functions of 

mTanL? 0 ? Ura ' W '* tah ™ icate ^ omer as we ° 

mechanisms of cell ular control and metabolic feedback. 

[0003] Genetic information is critical in continuation of life processes. Life is substantially informational^ based and 
-ts genetic content controls the growth and reproduction of the organism and its complements PolypepMes whiS 
are cntica I features of all living systems, are encoded by the genetic material of the cell In partcuKCpe^ 
enzymes functional proteins, and structural proteins are determined by the sequence of am'no acids wh£ mateZ, 
up^As structure and function are integrally related, many biological functions may be explained by etodarino the 
underlying the structural features which provide those functions. For this reason, it has become ve^poS to 
determ.ne the genetic sequences of nucleotides which encode the enzymes, structural proteins and other SoS 
of b,olog,cal functions. In addition to segments of nucleotides which encode peptides there ^ many ^dec ide 
sequences which are involved in control and regulation of gene expression. V nuc,eot,de 

[0004] The human genome project is directed toward determining the complete sequence the genome of the human 
organism Afthough such a sequence would not correspond to the sequence of any specific indivLa°Twould provide 

indmduals. It would also prov.de mapping information which is very useful for further detailed studies. However the 
need for highly rap.d, accurate, and inexpensive sequencing technology is nowhere more apparent than in a demanding 

[ ? 0< ! 5 !< J5f pr0Cedures used toda V sequencing include the Sanger dideoxy method, see e g Sanqer 

e aL 1977) Proc. Natl. Acad. Sci. USA 74:5463-5467, or the Maxam and Gilbert method, see, e.g., Maxam et a? 
1980) Methods in Enzymology, 65:499-559. The Sanger method utilizes enzymatic elongation procedures with chain 
terminating nucleotides. The Maxam and Gilbert method uses chemical reactions exhibiting specif fc%Tf r^on to 
genera e nucleotide specific cleavages. Both methods require a practitioner to perform a Targe numbeV o 'complex 

teZJ ™T* T "r 86 manipulations usual * re < uire isola «"9 homogeneous DNA fragments, elaborate and 
tedious preparing of samples, preparing a separating gel, applying samples to the gel, electrophoreses the samples 

into this gel, working up the finished gel, and analyzing the results of the procedure. 
SUMMARY OF THE INVENTION 

compLng 116 PreS6nt inVenti ° n Pr ° VideS 3 meth ° d identtfying or languishing a target nucleic acid in a sample 

(a) Providing an array of at least 100 different probes bound to a substrate in known locations and at a density of 
at least 1 000 probes per square centimetre; y 

(b) applying the sample to the substrate to obtain a hybridization pattern of the sample- and 

(c) companng the hybridization pattern with a reference pattern to identify or distinguish the target nucleic acid. 

2 irt!m» f ZK Ce ,K Pa T ^ be ° btained by applyin9 3 SeCOnd nucleic acid 10 the or anoth er said substrate. 

^' , referenCe Pattem may COmprtse a re,erence database of a P |ura »ty °' sources of nucleic 
™ he com P a " s on of sample and reference patterns permits an identification of the source of the sample 

[0009] In preferred embodiments, at least part of the sequence of each of the probes is known 

[001 0] In other preferred embodiments, at least some of the probes are oligonucleotides 

[001 1] The sample hybridization pattern may be analysed to generate a partial nucleotide sequence for the sample 

nucleic acid, and the partial nucleotide sequence is compared with a nucleotide sequence of the reference 

[0012] In certain embodiments, the second nucleic acid may be from an individual and the result of comparing the 

S, 6 8 , raference hybridization patterns determines whether the test sample is from the individual 

[0013] In other embodiments, the sample may be from an individual and the plurality of sources of nucleic acid are 

i by , P !! em ' a rei3tiVeS °' the indiYidua1, therebv permittinfl identification of a genealogy of the individual 
[0014] In further embodiments, the sample may be from an organism and the sources of nucleic acid are provided 
by a p uralrty of known individuals, thereby permitting Identification of the organism as one of the known individuals. 
[0015] In yet further embedments, the sample is from an abnormal, e.g. tumour, neoplastic, diseased or infected 
tissue and contains transcripts of the abnormal tissue, and the reference pattem is a reference database having ex- 
press.on patterns for a plurality of known abnormalities of tissue, the comparison of sample and reference patterns 
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permitting an identification of abnormal tissue. 

[0016] Other embodiments are provided where the sample may be from a tissue and contains nucleic acid of the 
tissue, and the reference pattern is a reference database having expression patterns for a plurality of known cells, the 
comparison of sample and reference patterns permitting an identification of the cellular composition, degree of cellular 

s differentiation, stage of cellular development or metastatic potential of the tissue. 

[0017] In yet other embodiments, the sample may be from a microbe and contains nucleic acid of the microbe, and 
the reference pattern is a reference database, the comparison of sample and reference patterns permitting an identi- 
fication of the microbe. The microbe may be selected from the group consisting of protozoa, virus and bacteria. 
[001 8] In all methods of the invention, the array preferably has at least 1 0 3 , more preferably at least 1 0 4 , even more 

10 preferably at least 1 0 5 , most preferably at least 1 0 6 different probes bound to the substrate. 

[001 9] The probes may be bound to the substrate at a density of at least 1 0 4 , preferably at least 1 0 5 , more preferably 
at least 10 6 known locations per square centimetre. 

[0020] The probes may be more than 1 5, preferably more than 25, more preferably more than 50 nucleotides in length . 
[0021] The present invention provides improved methods useful for verification of known sequences and for finger- 
's printing polymers. 

[0022] By reducing the number of manual manipulations required and automating most of the steps, the speed, 
accuracy, and reliability of these procedures are greatly enhanced. 

[0023] The production of a substrate having a matrix of positionally defined regions with attached reagents exhibiting 
known recognition specificity can be used for the sequence analysis of a polymer, fingerprinting, mapping, and general 
20 screening of specific interactions. 

[0024] The automation of the substrate production method and of the scan and analysis steps minimizes the need 
for human intervention. This simplifies the tasks and promotes reproducibility. 

[0025] The method of the invention employs a composition comprising a plurality of positionally distinguishable se- 
quence specific reagents attached to a solid substrate, which reagents are capable of specifically binding to a prede- 

25 term in ed subunit sequence of a preselected multi-subunit length having at least three subunits, said reagents repre- 
senting substantially all possible sequences of said preselected length. In some embodiments, the subunit sequence 
is a polynucleotide sequence. In other embodiments, the specific reagent is an oligonucleotide of at least about five 
nucleotides, preferably at least eight nucleotides, more preferably at least 1 2 nucleotides. Usually the specific reagents 
are ail attached to a single solid substrate, and the reagents comprise at least 3000 different sequences, in other 

30 embodiments, the reagents represents at least about 25% of the possible subsequences of said preselected length. 
Usually, the reagents are localized in regions of the substrate having a density of at least 25 regions per square cen- 
timeter, and often the substrate has a surface area of less than about 4 square centimeters. 
[0026] By way of example and not limitation, fingerprinting methods of the invention may be used for personal iden- 
tification, genetic screening, Identification of pathological conditions, determination of patterns of specific gene expres- 

35 sion, and others. 

[0027] The detecting of positions which bind target sequence would typically be through a fluorescent label on the 
target Although a fluorescent label is probably most convenient, other sorts of labels, e.g., radioactive, enzyme linked, 
optically detectable, or spectroscopic labels may be used. Because the oligonucleotide probes are positionally defined, 
the location of the hybridized duplex can directly translate to the sequences which hybridize. Thus analysis of the 
40 positions may provide a collection of subsequences found within the target sequence. These subsequences may be 
matched with respect to their overlaps so as to assemble an intact target sequence. 

[0028] Preferred embodiments will now be described by way of example and with reference to drawings in which: 
[0029] Fig. 1 illustrates a flow chart for sequence, fingerprint, or mapping analysis. 
[0030] Fig. 2 illustrates the proper function of a VLSIPS nucleotide synthesis. 
45 [0031] Fig. 3 illustrates the proper function of a VLSIPS dinucleotide synthesis. 
[0032] Fig. 4 illustrates the process of a VLSIPS trinucleotide synthesis. 

I. Overall Description 

so a. general 

B. VLSIPS substrates 

C. binary masking 

D. applications 

E. detection methods and apparatus 
55 R data analysis 

II. Theoretical Analysis 
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A. simple n-mer structure; theory 

B. complications 



III. Polynucleotide Sequencing 

A. preparation of substrate matrix 

B. labeling target polynucleotide 

C. hybridization conditions 

D. detection; VLSIPS scanning 

E. analysis 

F. substrate reuse 



15 



20 



25 



35 



40 



IV. Fingerprinting 

A. general 

B. preparation of substrate matrix 

C. labeling target nucleotides 

D. hybridization conditions 

E. detection; VLSIPS scanning 

F. analysis 

G. substrate reuse 

H. other polynucleotide aspects 

V. Mapping 



A. general 

B. preparation of substrate matrix 

C. labeling 

D. hybridization/specific interaction 
30 E. detection 

F. analysis 

G. substrate reuse 



VI. Additional Screening 

A. specific interactions 

B. sequence comparisons 

C. categorizations 

D. statistical correlations 

VII. Formation of Substrate 



A. instrumentation 

B. binary masking 

C. synthetic methods 

D. surface immobilization 

VIII. Hybridization/Specific Interaction 

A. general 

B. important parameters 

IX. Detection Methods 



55 



A. labeling techniques 

B. scanning system 

X. Data Analysis 



EP 0 834 575 B1 



A. general 

B. hardware 

C. software 

5 XI. Substrate Reuse 

A. removal of label 

B. storage and preservation 

C. processes to avoid degradation of oligomers 

w 

XII. Integrated Sequencing Strategy 

A. initial mapping strategy 

B. selection of smaller clones 

15 

XIII. Commercial Applications 

A. sequencing 

B. fingerprinting 
20 c. mapping 

I. OVERALL DESCRIPTION 

A. General 

25 

[0033] The present invention relies in part on the ability to synthesize or attach specific recognition reagents at known 
locations on a substrate, typically a single substrate. In particular, the present invention provides the ability to prepare 
a substrate having a very high density matrix pattern of positionally defined specific recognition reagents. The reagents 
are capable of interacting with their specific targets while attached to the substrate, e.g., solid phase interactions, and 

30 by appropriate labeling of these targets, the sites of the interactions between the target and the specific reagents may 
be derived. Because the reagents are positionally defined, the sites of the interactions will define the specificity of each 
interaction. As a result, a map of the patterns of interactions with specific reagents on the substrate is convertible into 
information on the specific interactions taking place, e.g., the recognized features. Where the specific reagents recog- 
nize a large number of possible features, this system allows the determination of the combination of specific interactions 

35 which exist on the target molecule. Where the number of features is sufficiently large, the identical same combination, 
or pattern, of features is sufficiently unlikely that a particular target molecule may often be uniquely defined by its 
features. In the extreme, the features may actually be the subunit sequence of the target molecule, and a given target 
sequence may be uniquely defined by its combination of features. 

[0034] In particular, the methodology is applicable to sequencing polynucleotides. The specific sequence recognition 
40 reagents will typically be oligonucleotide probes which hybridize with specificity to subsequences found on the target 
sequence. A sufficiently large number of those probes allows the fingerprinting of a target polynucleotide or the relative 
mapping of a collection of target polynucleotides, as described in greater detail below. 

[0035] In the high resolution fingerprinting provided by a saturating collection of probes which include all possible 
subsequences of a given size, e.g., 1 0-mers, collating of all the subsequences and determination of specific overlaps 

45 will be derived and the entire sequence can usually be reconstructed. 

[0036] Sequence analysis may take the form of complete sequence determination, to the level of the sequence of 
individual subunits along the entire length of the target sequence. Sequence analysis also may take the form of se- 
quence homology, e.g., less than absolute subunit resolution, where "similarity" in the sequence will be detectable, or 
the form of selective sequences of homology interspersed at specific or irregular locations. 

so [0037] In either case, the sequence is determinable at selective resolution or at particular locations. Thus, the hy- 
bridization method will be useful as a means for identification, e.g., a "fingerprint", much like a Southern hybridization 
method is used. It is also useful to map particular target sequences. 

B. VLSIPS Substrates 

55 ~ ~ 

[0038] The invention is enabled by the development of technology to prepare substrates on which specific reagents 
may be either positionally attached or synthesized. In particular, the very large scale immobilized polymer synthesis 
(VLSIPS) technology allows for the very high density production of an enormous diversity of reagents mapped out in 
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Tu 106 Q " °™ ' te T t U ! y a " d automatica,| y synthesized including numbers in excess of about 1<? fo» 1<? 
10 106 or even more, and at densities of at least about102 103/cm2 1 of/cm*, lOWanduoto lowlrml™ 
^sapphcat^n discloses methods forsynthesizing polymers on a silicon oroX^sJ^'l^ZS^Z^ 

Z l^ZTllTTT^ 1/(563 ° f bi ° ,09ical P0,ymers on *~ substLes ?scknnt 
^ !. ! raCt ' 0n h3S 0CCUrred 31 Specifi ° ,OCat)ons on the substrate . and various other technolon es 



C. Binary Masking 



2 "[here are varlous Particular ways to optimize the synthetic processes 

[0044] Briefly, the binary synthesis strategy refers to an ordered strateqv for parallel svnthP^k nf rWre^ n k 

ZtanL i , I mary nUmberS from 1 10 n arran 9 ed in colu ™s. In preferred embodiments a binarv 

strategy is one in wh.ch at least two successive steps illuminate half of a region of interest on the sutetrate \1 , 

example a strategy .n whrch a switch matrix for a masking strategy halves regions that were previous 

ha) of previous* protected reg,ons and illuminating about half of previously protected regions) It wi! be reZS 
hatbinary rounds may be interspersed with non-binary mundsandthatonlyaportonof asSatem^ 

l^s^stS™ T nSide h re h d 10 ^ 8 ^ maSkin9 SCheme ^™«1 S 
ZSjSJZh'i SSr 8 US6S " 9ht 10 rem ° Ve Pr0tSCtiVe 9r ° UPS ,r ° m materia ' S * addi «°» 2 
nH^L "! parti f U,ar ' 1 this P^cedure provides a simplified and highly efficient method for saturating all possible se- 

D. Applications 

[0046] The technology provided by the present invention has very broad applications. Although described specif icallv 
too P o^^ 

wCsecond C L*! hydrate ' ° r P ° tymere - ThiS may be f0r de novo s ^^ncing, or may be used in conjSon 
wrth a second sequencng procedure to provide independent verification. See, e.g., (1988) Science 242^245 Tor 
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example, a large polynucleotide sequence defined by either the Maxam and Gilbert technique or by the Sanger tech- 
nique may be verified by using the present invention. 

[0047] In addition, by selection of appropriate probes, a polynucleotide sequence can be fingerprinted. Fingerprinting 
is a less detailed sequence analysis which usually involves the characterization of a sequence by a combination of 
5 defined features. Sequence fingerprinting is particularly useful because the repertoire of possible features which can 
be tested is virtually infinite. Moreover, the stringency of matching is also variable depending upon the application. A 
Southern Blot analysis may be characterized as a means of simple fingerprint analysis. 

[0048] Fingerprinting analysis may be performed to the resolution of specific nucleotides, or may be used to determine 
homologies, most commonly for large segments. In particular, an array of oligonucleotide probes of virtually any work- 
to able size may be positionally localized on a matrix and used to probe a sequence for either absolute complementary 
matching, or homology to the desired level of stringency using selected hybridization conditions. 
[0049] In addition, the present invention provides means for mapping analysis of a target sequence or sequences. 
Mapping will usually involve the sequential ordering of a plurality of various sequences, or may involve the localization 
of a particular sequence within a plurality of sequences. This may be achieved by immobilizing particular large segments 
is onto the matrix and probing with a shorter sequence to determine which of the large sequences contain that smaller 
sequence. Alternatively, relatively shorter probes of known or random sequence may be immobilized to the matrix and 
a map of various different target sequences may be determined from overlaps. Principles of such an approach are 
described in some detail by Evans et al. (1 989) "Physical Mapping of Complex Genomes by Cosmid Multiplex Analysis, 
" Proc. Natl. Acad. Sci. USA 86:5030-5034; Michiels et al. (1987) "Molecular Approaches to Genome Analysis: A 
20 Strategy for the Construction of Ordered Overlap Clone Libraries," CABIOS 3:203-210; Olsen et al. (1986) "Random- 
Clone Strategy for Genomic Restriction Mapping in Yeast," Proc. Natl. Acad. Sci. USA 83:7826-7830; Craig, et al. 
(1990) "Ordering of Cosmid Clones Covering the Herpes Simplex Virus Type I (HSV-I) Genome: A Test Case, for 
Fingerprinting by Hybridization," Nuc. Acids Res. 18:2653-2660; and Coulson, et al. (1986) "Toward a Physical Map 
of the Genome of the Nematode Caenorhabditis elegans," Proc. Natl. Acad. Sci. USA 83:7821-7825. 
25 [0050] Fingerprinting analysis also provides a means of identification. In addition to its value in apprehension of 
criminals from whom a biological sample, e.g., blood, has been collected, fingerprinting can ensure personal identifi- 
cation for other reasons. For example, it may be useful for identification of bodies in tragedies such as fire, flood, and 
vehicle crashes. In other cases the identification may be useful in identification of persons suffering from amnesia, or 
of missing persons. Other fore nsics applications include establishing the identity of a person, e.g., military identification 
so "dog tags", or may be used in identifying the source of particular biological samples. Fingerprinting technology is de- 
scribed, e.g., in Carrano, et al. (1989) "A High-Resolution, Fluorescence-Based, Semi-automated method for DNA 
Fingerprinting," Genomics 4: 129-136. See, e.g., table I, for nucleic acid applications. 



TABLE I 



35 


VLSIPS PROJECT IN NUCLEIC ACIDS 




I. 


Construction of Chips 






II. 


Applications 






40 




A. 


Sequencing 










1. 


Primary sequencing 








2. 


Secondary sequencing (sequence checking) 


45 






3. 


Large scale mapping 






4. 


Fingerprinting 








B. 


Duplex/Triplex formation 








1. 


Antisense 




50 






2. 


Sequence specific function modulation (e.g. promoter inhibition) 






C. 


Diagnosis 










1. 


Genetic markers 








2. 


Type markers 




55 








a. 
b. 


Blood donors 
Tissue transplants 






D. 


Microbiology 
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25 



VLSIPS PROJECT IN NUCLEIC ACIDS 



TABLE I (continued) 



10 



15 



IV. 



Clinical microbiology 
Food microbiology 



Instrumentation 

A. I Chip machines 

B. Detection 
Software Development 



A. 

B. 
C. 



Instrumentation software 
Data reduction software 
Sequence analysis software 



a large number of genetfc ^Tol^ZlT?' *" Simultaneous genetic screeningTor 

more generally accessible ' P " d,a9n ° 8tlC SCreen ' ng 030 be sim P |ified - economized, and made 

E. Detection Methods and Apparatus 
[0053] An appropriate detection method applicable to the selected lah^inn m*th^ . ^ 

include radionuclides, enzymes substrates rXtnrc mw^ 9 h ° d Can be se,ected ' Suitab,e ,ab *'s 

scopy (AFM), electrical condutance and image plate transS? } " ™ ros «>PK atomic force micro- 

SM *" b ° -*-L Apparatus, as 

iflcationsmayalsobeinc^rate^ 
F. Data Analysis 

EffiSisri 
So^ 

Eioi?^ 

o'agivenpoiymerhas^ 

having dimensions of 500 microns by 500 mfc^^^^^ 

by 5 microns. In most preferred ^bodilT tnl' pninn * „ " 0Ver re 9 ions havin 9 dimensions of 5 microns 

are less than about 1/2 he a^a^e SoS n wShTdZ ^ ^ "* tBk-n acr0SS the substrate 
the area in which a singte polypi ^SZZZTSZ^ TT? SyntheSi2ed - P referab| V than 1/10 

polymer is synthesized HeSceTiuifn nTaZ ^ K P tha " 1/100 the area in which a ^le 

fluorescence data poSs „ JcoileS " " 3 ^ *** been «*"*-»»* a .arge number of 

Ssir^^ 
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[0059] Accordingly, in one embodiment of the invention the data are corrected for removal of these spurious data 
points, and an average of the data points is thereafter utilized in determining relative binding efficiency. In general the 
data are fitted to a base curve and statistically measures are used to remove spurious data. 
[0060] In an additional analytical tool, various degeneracy reducing analogues may be incorporated in the hybridi- 
5 zation probes. Various aspects of this strategy are described, e.g., in Macevicz, S. (1990) PCT publication number 
WO 90/04652. 

II. THEORETICAL ANALYSIS 

10 [0061 ] The principle of the hybridization sequencing procedure is based, in part, upon the ability to determine overlaps 
of short segments. The VLSIPS technology provides the ability to generate reagents which will saturate the possible 
short subsequence recognition possibilities. The principle is most easily illustrated by using a binary sequence, such 
as a sequence of zeros and ones. Once having illustrated the application to a binary alphabet, the principle may easily 
be understood to encompass three letter, four letter, five or more letter, even 20 letter alphabets. A theoretical treatment 

15 of analysis of subsequence information to reconstruction of a target sequence is provided, e.e., in Lysov, Yu., et al. 
(1988) Doklady Akademi. Nauk. SSR 303:1508-1511; Khropko K., et al. (1989) FEBS Letters 256:118-122; Pevzner, 
P. (1989) J. of Biomolecular Structure and Dynamics 7:63-69; and Drmanac, R. etal. (1989) Genomics 4:114-128. 
[0062] The reagents for recognizing the subsequences will usually be specific for recognizing a particular polymer 
subsequence anywhere within a target polymer. It is preferable that conditions may be devised which allow absolute 

20 discrimination between high fidelity matching and very low levels of mismatching. The reagent interaction will preferably 
exhibit no sensitivity to flanking sequences, to the subsequence position within the target, or to any other remote 
structure within the sequence. 

A. Simple n-mer Structure: Theory 

25 " ~~ ' 

1 . Simple two letter alphabet: example 

[0063] A simple example is presented below of how a sequence of ten digits comprising zeros and ones would be 
sequenceable using short segments of five digits. For example, consider the sample ten digit sequence: 
30 1010011100. 

A VLSIPS substrate could be constructed, as discussed elsewhere, which would have reagents attached in a defined 
matrix pattern which specifically recognize each of the possible five digit sequences of ones and zeros. The number 
of possible five digit subsequences is 2 5 = 32. The number of possible different sequences 1 0 digits long is 2 10 = 1 ,024. 
The five contiguous digit subsequences within a ten digit sequence number six, i.e., positioned at digits 1 -5, 2-6, 3-7, 
35 4-8, 5-9, and 6-1 0. It will be noted that the specific order of the digits in the sequence is important and that the order 
is directional, e.g., running left to right versus right to left. The first five digit sequence contained in the target sequence 
is 1 01 00. The second is 01 001 , the third is 1 0011 , the fourth is 001 1 1 , the fifth is 01 1 1 0, and the sixth is 1 1 1 00. 
[0064] The VLSIPS substrate would have a matrix pattern of positionally attached reagents which recognize each 
of the different 5-mer subsequences. Those reagents which recognize each of the 6 contained 5-mers will bind the 
40 target, and a label allows the positional determination of where the sequence specific interaction has occurred. By 
correlation of the position in the matrix pattern, the corresponding bound subsequences can be determined. 
[0065] In the above-mentioned sequence, six different 5-mer sequences would be determined to be present. They 
would be: 
10100 
45 01001 
10011 
00111 
01110 
11100 

so [0066] Any sequence which contains the first five digit sequence, 101 00, already narrows the number of possible 
sequences (e.g., from 1024 possible sequences) which contain it to less than about 192 possible sequences. 
[0067] This 1 92 is derived from the observation that with the subsequence 1 01 00 at the far left of the sequence, in 
positions 1 -5, there are only 32 possible sequences. Likewise, for that particular subsequence in positions 2-6, 3-7, 
4-8, 5-9, and 6-10. So, to sum up all of the sequences that could contain 10100, there are 32 for each position and 6 

ss positions for a total of about 192 possible sequences. However, some of these 10 digit sequences will have been 
counted twice. Thus, by virtue of containing the 10100 subsequence, the number of possible 1 0-mer sequences has 
been decreased from 1024 sequences to less than about 192 sequences. 

[0068] In this example, not only do we know that sequence contains 1 0100, but we also know that it contains the 
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Plus a next digit to the left. For exam P te ^2^^^ 7^! WhiCh C ° ntains 11,6 four ,eft ™ st di 9'* 
^"°wthatthe10100mustbea^ 
s therightmostfourdigitsplusanextdigftto^ 

01001, and that X is a 1. Thus, we know a, 2 mft'S 

terminal sequence 1 01 001 . ^ et se( ' uence ha s an overlap of 01 00 and has the left 

[0069] Applying the same procedure to the second 5-mer we also know »h=, th„ 

^^digitshavingmesequencelOOfYwhereYmusrbe^e t!!^™™^^"^^ 
° that we have a 10011 sequence within our taj* Tus Y is 1 0 1 Thus wi ^ the ,ra9ments and we see 
sequence of the first seven being 101 0011 US * WS W0Uld know that ° ur sequence has a 

EoLtZCer^^ 

1. Thus, we know the sequence must start withTot 0011 1 * qUen ° 6 C ° mainS 3 0011 1 ^sequence and Z is 

point is 10100111a We knowthatthe lasts-mer Zt be J££!S^oT?2£ ' V 1 " 8 ' our se " uence to this 
and that must be the last of our sequence. Thus we have 2Sm!!2 thl ab ° Ve ' We see that ft is 1 11 °° 

[0072J However.ftwillberecognLdfrom^ 
anafysiscanstartwtthanyknownpositi^ 
a'°"9 the sequence checking the known s^^ 

sequence may be determined besides ^snn^T T ber ° f n6Xt positions ' Give " this possibility the 
oniy where the" next poS pS Seffrlr^f P ° Sfti0nS ' by »S 
longer time span dedicated towards ; sea™ in fl „T ,ncreasathe «"" P lexity of the scanning but may provide a 
possibiiities. Thus, the sc^Z Z^ZTJl^ ^ * intereSt re,ativ6 10 «her sequence 

oligonucleotide to only iook at £^™ZZZ2™-» a,on S a . -»««• from a given contained 
[0073] It is seen that given a sequence ft » he d™Itn , I ♦ 6XPeCted t0 have a positive si 9"al. 
subsequences. From a^y given ta^J 

hybridization sequence method depends In p a rt uoon be n. It t T & fragmente would M The 
known sequences to the full sequence In simple «Tes 2 Til tTj" * r6VerSS ' ,r ° m 8 Set ° f fra 9 metlts ° f 

zrrnur r n thee ^° f ^ 

Thus 'a ^TZ^oZT: ^fiZ"^ ~ -ry quickiy w,h the length of that sequence, 
andaao-merhasoverab...^^ 

a 30 character target sequence having ove a millon c^sfcJe X? ^T* intemal 5 ' mer Se " uences - Th "* 
different 5-mers. It will be recoonized that th P nroh > P s< Wces can be substantially defined by only 26 
identical length, and thatm^C^ S^ESSS" ** " aed 001 neCessari *- * ° 

es need not differ by only a single sS mo^^ 
^^«ctualVcomaina P luraTftyof p res 0 ?S 

specifications would be preferred a less than f ,n JT * alth0U9h a " of tne P ossib| e subsequence 

a substantia, fraction irSCSCzTSl?! 0 " 8 ?" be M h particu,ar " ««-S 



2. Example of four tette r alphabet 



Sto^r^rs 

-hoftheovertaps.As^ndway^ 
E,h^ 

4-charactera.phabetwith 10 positions, fl^^^ ^TSlS^S. = 024 P ° SSib ' 9 S6qUenCeS " in a 
of afour character sequence has amuch larqernumbe rJ ^ 0 ,, B ^ ,W,6W ^* te8e< " uenee "-^'».theeornpta(tty 
Note, however, that there are still onll dSe LSl. f lble J 3et " Jences C0 ™P™« to a two character sequence 
with 3 character subsequent TSS^fSt t^ST*^ ? Sha " * 5 Character s,rin 9 

us take the sequence GGCTA. The 3-nw, subl'eque^ ^ ^ * **■ A ' C ' G ' and T - ^ 

GGC 



GGC 
GCT 
CTA 
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Given these subsequences, there is one sequence, or at most only a few sequences which would produce that com- 
bination of subsequences, i.e., GGCTA. 

[0077] Alternatively, with a four character universe, the binary system can be looked at: in pairs of digits. The pairs 
would be 00, 01, 10, and 11. In this manner, the earlier used sequence 1010011100 is looked at as 10,10,01,11,00. 
5 Then the first character of two digits is selected from the possible universe of the four representations 00, 01 , 1 0, and 
11. Then a probe would be in an even number of digits, e.g., not five digits, but, three pairs of digits or six digits. A 
similar cpmparison is performed and the possible overlaps determined. The 3-pair subsequences are: 
10,10,01 
10,01,11 
10 01,11,00 

and the overlap reconstruction produces 10,10,01,11,00. 

[0078] The latter of the two conceptual views of the 4 letter alphabet provides a representation which is similar to 
what would be provided in a digital computer. The applicability to a four nucleotide alphabet is easily seen by assigning, 
e.g., 00 to A, 01 to C, 10 to G, and 11 to T. And, in fact, if such a correspondence is used, both examples for the 4 
*5 character sequences can be seen to represent the same target sequence. The applicability of the hybridization method 
and its analysis for determining the ultimate sequence is easily seen if A is the representation of adenine, C is the 
representation of cytosine, G is the representation of guanine, and T is the representation of thymine or uracil. 

B. Complications 

20 

[0079] Two obvious complications exist with the method of sequence analysis by hybridization. The first results from 
a probe of inappropriate length while the second relates to internally repeated sequences. 

[0080] The first obvious complication is a problem which arises from an inappropriate length of recognition sequence, 
which causes problems with the specificity of recognition. For example, if the recognized sequence is too short, every 

25 sequence which is utilized will be recognized by every probe sequence. This occurs, e.g., in a binary system where 
the probes are each of sequences which occur relatively frequently, e.g., a two character probe for the binary system. 
Each possible two character probe would be expected to appear % of the time in every single two character position. 
Thus, the above sequence example would be recognized by each of the 00, 10, 01, and 11. Thus, the sequence 
information is virtually lost because the resolution is too low and each recognition reagent specifically binds at multiple 

30 sites on the target sequence. 

[0081] The number of different probes which bind to a target depends on the relationship between the probe length 
and the target length. At the extreme of short probe length, the just mentioned problem exists of excessive redundancy 
and lack of resolution. The lack of stability in recognition will also be a problem with extremely short probes. At the 
extreme of long probe length, each entire probe sequence is on a different position of a substrate. However, a problem 

35 arises from the number of possible sequences, which goes up dramatically with the length of the sequence. Also, the 
specificity of recognition begins to decrease as the contribution to binding by any particular subunit may become suf- 
ficiently low that the system fails to distinguish the fidelity of recognition. Mismatched hybridization may be a problem 
with the polynucleotide sequencing applications, though the fingerprinting and mapping applications may not be so 
strict in their fidelity requirements. As indicated above, a thirty position binary sequence has over a million possible 

40 sequences, a number which starts to become unreasonably large in its required number of different sequences, even 
though the target tength is still very short. Preparing a substrate with all sequence possibilities for a long target may 
be extremely difficult due to the many different oligomers which must be synthesized. 

[0082] The above example illustrates how a long target sequence may be reconstructed with a reasonably small 
number of shorter subsequences. Since the present day resolution of the regions of the substrate having defined 

45 oligomer probes attached to the substrate approaches about 1 0 microns by 1 0 microns for resolvable regions, about 
1 0 6 , or 1 million, positions can be placed on a one centimeter square substrate. However, high resolution systems may 
have particular disadvantages which may be outweighed using the lower density substrate matrix pattern. For this 
reason, a sufficiently large number of probe sequences can be utilized so that any given target sequence may be 
determined by hybridization to a relatively small number of probes. 

so [0083] A second complication relates to convergence of sequences to a single subsequence. This will occur when 
a particular subsequence is repeated in the target sequence. This problem can be addressed in at least two different 
ways. The first, and simpler way, is to separate the repeat sequences onto two different targets. Thus, each single 
target will not have the repeated sequence and can be analyzed to its end. This solution, however, complicates the 
analysis by requiring that some means for cutting at a site between the repeats can be located. Typically a careful 

55 sequencer would want to have two intermediate cut points so that the intermediate region can also be sequenced in 
both directions across each of the cut points. This problem is inherent in the hybridization method for sequencing but 
can be minimized by using a longer known probe sequence so that the frequency of probe repeats is decreased. 
[0084] Knowing the sequence of flanking sequences of the repeat will simplify the use of polymerase chain reaction 
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for analysis. See, e.g., Inniseta S^o^SS^^hT^'^^^^^ 18 ^^""^ 
and methods for synthesis of oCffl^Sgtgg^gg^ftB Academic Pres * 
Approach . IRL Press, Oxford. ' 9 " * (1 984) QI| g°""eleotid e Synthesis: A Practical 



^o'thedegeneracyreducingana^^ 
to fully saturate the possible o^ 
• 4-mer of degenerate mttM^SSSS^ "JElTSS T ' T ° f 1 * m " i h3Ving the 

number of possible 8-mers. e.g. 65 53^ 65 536 131 I7I £ ?h ^ the C ° ,leCtion numbers •«*» the 

possible 12-mers. ' " 131 ,072, bUt the P°P ula t'°n Provides screening equivalent to all 

[0086 N1 SSESSSSST 1 a " P ° SSib,e ° ,i90nUC,e0tlde ^ ^ be depicted ^e fashion: 

^Ss^*^ * « = 3* chemical binary 

nucleotides. oVh^ 

12-mers can be made in the fashion: eorreapondinfl complementary nucleotide, new oligonucleotides 

N1 -N2-N3-N4-D-D-D-D-N5-N6-N7-N8 
in which there are again, as above, on.y 4* 1 65,536 possible "12-mers-, which in reality only have 6 different nuc,e- 

So™^ 

nation of the two sets, i^fcTnuSSSSK 65 536 ° ^g^^."^ u "*MI making 48 = 65,536. The combi- 
16,777,216 molecules. TiuslZTo^n^ tiS^JTJ T mMeS< bUt 9 ' Ving the ,nfomation * 
necessary to get 12-mer mSZ^JZ££S^ *° ° f m ° leCU,es 

III. POLYNUCLEOTIDE SEQUENCING 

gonudeottfe, M .tod « fSJSSS ~ ? . . ^""""""a es * ■* —V *!«*« possible oil. 
A. Preparation of SubstratA Matrix 

Sin ?r e rr «- * ^« m ay be Pro . 

solid phase or other ^SKXSTi^^ t * ° f ten nUCleotide 0,iao ™* « a 

California. Although a Lg.e o.igoVuSde ca n 'bf iSS^T a * * SSJSST^ ^ 
requ,re a fairly large amount of time and investment. For exlpte th^lre 4^-1^76 1 17 
oligomers. Present technology allows makina each and even, ™ „ f LI ~ ' P SSlble ten nucleot ide 
might be costly and laborious * ' th6m ' n 8 Separate P urified fofm though such 

-14chtgtS 

WO90/15070 an'd PCT pubSon no K2f " ^ 8 * ^ h PCT pub,ica,i °" «■ 

Position, UseofphotosenS^^ 
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a matrix pattern. By use of the binary masking strategy, the surface of the substrate can be positioned to generate a 
desired pattern of regions, each having a defined sequence oligonucleotide synthesized and immobilized thereto. 
[0093] Although the prior art technology can be used to generate the desired repertoire of oligonucleotide probes, 
an efficient and cost effective means would be to use the VISIPS technology described in PCT publication no. 
5 WO90/1 5070. In this embodiment, the photosensitive reagents involved in the production of such a matrix are described 
below. 

[0094] The regions for synthesis may be very small, usually less than about 100 urn x 100 urn, more usually less 
than about 50 jim x 50 u/n. The photolithography technology allows synthetic regions of less than about 10 |im x 10 
u.m ( about 3 |im x 3 u.m, or less. The detection also may detect such sized regions, though larger areas are more easily 
10 and reliably measured. 

[0095] At a size of about 30 microns by 30 microns, one million regions would take about 11 centimeters square or 
a single wafer of about 4 centimeters by 4 centimeters. Thus the present technology provides for making a single matrix 
of that size having all one million plus possible oligonucleotides. Region size are sufficiently small to correspond to 
densities of at least about 5 regions/cm 2 , 20 regions/cm 2 , 50 regions/cm 2 , 1 00 regions/cm 2 , and greater, including 300 
15 regions/cm 2 , 1 000 regions/cm 2 , 3K regions/cm 2 , 1 0K regions/cm 2 , 30K regions/cm 2 , 1 00K regions/cm 2 , 300K regions/ 
cm 2 or more, even in excess of one million regions/cm 2 . 

[0096] Although the pattern of the regions which contain specific sequences is theoretically not important, for practical 
reasons certain patterns will be preferred in synthesizing the oligonucleotides. Binary masking algorithms can be ap- 
plied to generate the pattern of known oligonucleotide probes. By use of these binary masks, a highly efficient means 

20 is provided for producing the substrate with the desired matrix pattern of different sequences. Although the binary 
masking strategy allows for the synthesis of all lengths of polymers, the strategy may be easily modified to provide 
only polymers of a given length. This is achieved by omitting steps where a subunit is not attached. 
[0097] The strategy for generating a specific pattern may take any of a number of different approaches. However, 
the binary masking and binary synthesis approaches provide a maximum of diversity with a minimum number of actual 

25 synthetic steps. 

[0098] The length of oligonucleotides used in sequencing applications will be selected on criteria determined to some 
extent by the practical limits discussed above. For example, if probes are made as oligonucleotides, there will be 65,536 
possible eight nucleotide sequences. If a nine subunit oligonucleotide is selected, there are 262,144 possible perme- 
ations of sequences. If a ten-mer oligonucleotide is selected, there are 1 ,048,576 possible permutations of sequences. 

30 As the number gets larger, the required number of positionally defined subunits necessary to saturate the possibilities 
also increases. With respect to hybridization conditions, the length of the matching necessary to converse stability of 
the conditions selected can be compensated for. See, e.g., Kanehisa, M. (1984) Nuc. Acids Res. 12:203-213. 
[0099] Although not described in detail here, but below for oligonucleotide probes, the VLSIPS technology would 
typically use a photosensitive protective group on an oligonucleotide. Sample oligonucleotides are shown in Figure 4. 

35 in particular, the photoprotective group on the nucleotide molecules may be selected from a wide variety of positive 
light reactive groups preferably including nitro aromatic compounds such as o-nitrobenzyl derivatives or benzylsulfonyl. 
See, e.g., Gait (1 984) Oligonucleotide Synthesis: A Practical Approach , IRL Press, Oxford. In apreferred embodiment, 
6-nitro-veratryl oxycarbony (NVOC), 2-nitrobenzyl oxycarbonyl (NBOC), ora.a-dimethyl-dimethoxybenzyl oxycarbonyl 
(DEZ) is used. Photoremovable protective groups are described in, e.g., Patchornik (1970) J. Amer. Chem. Soc. 92: 

40 6333; and Amit et al. (1974) J. Organic Chem. 39:192. 

[0100] A preferred linker is used to attach the oligonucleotide to a silicon matrix. A more detailed description Is 
provided below. A photosensitive blocked nucleotide may be attached to specific locations of unblocked prior cycles 
of attachments on the substrate and can be successively built up to the correct length oligonucleotide probe. 
[0101] It should be noted that multiple substrates may be simultaneously exposed to a single target sequence where 

45 each substrate is a duplicate of one another or where, in combination, multiple substrates together provide the complete 
or desired subset of possible subsequences. This provides the opportunity to overcome a limitation of the density of 
positions on a single substrate by using multiple substrates, in the extreme case, each probe might be attached to a 
single bead or substrate and the beads sorted by whether there is a binding interaction. Those beads which do bind 
might be encoded to indicate the subsequence specificity of reagents attached thereto. 

so [0102] Then, the target may be bound to the whole collection of beads and those beads that have appropriate specific 
reagents on them will bind to target. Then a sorting system may be utilized to sort those beads that actually bind the 
target from those that do not. This may be accomplished by presently available cell sorting devices or a similar appa- 
ratus. After the relatively small number of beads which have bound the target have been collected, the encoding scheme 
may be read off to determine the specificity of the reagent on the bead. An encoding system may include a magnetic 

55 system, a shape encoding system, a color encoding system, or a combination of any of these, or any other encoding 
system. Once again, with the collection of specific interactions that have occurred, the binding may be analyzed for 
sequence information, fingerprint information, or mapping information. 

[0103] The parameters of polynucleotide sizes of both the probes and target sequences are determined by the ap- 
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S^^l!!ps!ZTT C T !f n9th ° f thS 0li 9° nucl60,ide P r ^es used will depend in part upon the limita- 
reach the point where an increase in number of probes becomes disadvantageous. However nSZ^SZ^ 

tzr:z that ? e . system be ab,e ,o distin9uish ' b * seiec «°" - hyb*ziw2Z* ■ 

tSTSSE?"** fideKty ° f com P' ementar V se ""ences containing mSSSS^SSS 

hand, if the fidefty ,s unnecessary, this discrimination is also unnecessary and a signfficantly longer oroo mav be 

S*£^h*,T Pr ° b6S W ° Uld tyPiCa " y b6 US6fUl in ""Anting or mapping appl cations 9 ' " 
[0104] The length of the probe is selected for a length that it will bind with specificity to possible targets The hvbrid 
teatton conditions are also very important in that they will determine how close the homology of comZen^ bind*g 
w I be detected h fact, a single target may be evaluated at a number of different conditions to ZSK£2 
ofspecifc.tyforb.nd.ngpariicularprobes.Thismayfinduseinanumberofotherapplicationsbesides^^^^^^ 
sequencing fingerpnnting or mapping. In a related fashion, deferent regions wifiT reagents having ^WeriSnSefor 
levels of specificity may allow such a spectrum to be defined using a single Incubation, whe^e variou Sns at a 

specific defined non-matches may be used. Unnatural nucleotides or nucleotides exhibiting modified LdfcsTof 
complementary bmding are described in greater detail in Macevicz (1990) PCT pub. No. WC 9W4K2 anTsefthe 
section on modified nucleotides in the Sigma Chemical Company catalogue. ' * 

B. Labeling Target Nucleotide 

SUE, -i he l T\ Tr 10 d6teCt the ter9et se " uences wil1 b * determined, in part, by the detection methods beinq 
applied. Thus, the labehng method and .abel used are selected in combination with the actual detecting systems being 

[0106] Once a particular label has been selected, appropriate labeling protocols will be applied, as described below 
for specific embodiments. Standard labeling protocols for nucleic acids are described, e.g. in Samtrook etT Kam 

s eTe a a! 

see, e.g., Allen G. (1989) Sequencing of Proteins and Pe ptide Elsevier, New York especially chaoterfi ; 3 r»« 
ste,nand W in ta (1961)Chemistrvof m8 Amin 0 A^w i L a nH a ^e NewYo^Cag^ ISSmISS 
e g in Chapl.n and Kennedy (1986) Carbohydrate Analysis: A Practical Approach I RL Press Oxford Lab^inq of 
r^Thet^ — having ol^S, t 

E S0m , e emb ° dim ents, the target need not actual* be labeled if a means for detecting where interaction takes 
place .s available. As described below, for a nucleic acid embodiment, such may be provided by an inteSno Te 

ZT^Ts^ ,n, ° d ° Ub,e Strand6d Se9mentS - e - 9 ' Where interacti0 " ~ See. e g?SheS?ut 
[01 1 08] In many uses, the target sequence will be absolutely homogeneous, both with respect to the total seouence 

It is preferable that the target sequences of interest not be contaminated with a significant amount of labeled contam- 
5? !nhrr nCeS ; J' 6 SXtent °' a " 0Wab,e contamina «<>" *«' depend on the sensitivity of the deZlZtd 

s;sr of the system - Horno9eneous contamina,ion sequences w '" be *S3 

Z°X^n^XT 9et po * nucleotide must have a ^oe, the target molecules need not 
have identical ends. In fact, the homogeneous target molecule preparation may be randomly sheared to increase the 
numencai number of molecules. Since the total information content remains the same, the JS^ZS^SSl 
h.ghernumberofd,st.nctsequences which may be labeled and bind to the probe. This f agmentet^ 
fzafion !£ZZ t0 !, pre P aration ° f the ""•"«"« having homogeneous enSs. ThTSgn^STe £j 

Sthet^^ 

of the target may often be preferred before the labeling procedure is performed, thereby producing a large number of 
labeling groups associated with each subsequence. ^ nurroer oi 

C. Hybridization Conditions 

[01 10] The hybridization conditions between probe and target should be selected such that the specific recognition 
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interaction, i.e., hybridization, of the two molecules is both sufficiently specific and sufficiently stable. See, e.g., Hames 
and Higgins (1985) Nucleic Acid Hybridisation: A Practical Approach , IRL Press, Oxford, These conditions will be 
dependent both on the specific sequence and often on the guanine and cytosine (GC) content of the complementary 
hybrid strands. The conditions may often be selected to be universally equally stable independent of the specific se- 

5 quences involved. This typically will make use of a reagent such as an arylammonium buffer. See, Wood et al. (1985) 
"Base Composition-independent Hybridization in Tetramethylammonium Chloride: A Method for Oligonucleotide 
Screening of Highly Complex Gene Libraries," Proc. Natl. Acad. Sci. USA, 82:1585-1588; and Krupovetal. (1989) "An 
Oligonucleotide Hybridization Approach to DNA Sequencing," FEBS Letters, 256:118-122. An arylammonium buffer 
tends to minimize differences in hybridization rate and stability due to GC content. By virtue of the fact that sequences 

10 then hybridize with approximately equal affinity and stability, there is relatively little bias in strength or kinetics of binding 
for particular sequences. Temperature and salt conditions along with other buffer parameters should be selected such 
that the kinetics of renatu ration should be essentially independent of the specific target subsequence or oligonucleotide 
probe involved. In order to ensure this, the hybridization reactions will usually be performed in a single incubation of 
all the substrate matrices together exposed to the identical same target probe solution under the same conditions. 

15 [01 1 1 ] Alternatively, various substrates may be individually treated differently. Different substrates may be produced, 
each having reagents which bind to target subsequences with substantially identical stabilities and kinetics of hybrid- 
ization. For example, ail of the high GC content probes could be synthesized on a single substrate which is treated 
accordingly. In this embodiment, the arylammonium buffers could be unnecessary. Each substrate is then treated in a 
manner that the collection of substrates show essentially uniform binding and the hybridization data of target binding 

20 to the individual substrate matrix is combined with the data from other substrates to derive the necessary subsequence 
binding information. The hybridization conditions will usually be selected to be sufficiently specific that the fidelity of 
base matching will be properly, discriminated. Of course, control hybridizations should be included to determine the 
stringency and kinetics of hybridization. 

25 D. Detection; VLSIPS Scanning 

[01 1 2] The next step of the sequencing process by hybridization involves labeling of target polynucleotide molecules. 
A quickly and easily detectable signal is preferred. The VLSIPS apparatus is designed to easily detect a fluorescent 
label, so fluorescent tagging of the target sequence is preferred. Other suitable labels include heavy metal labels, 
30 magnetic probes, chromogenic labels (e.g., phosphorescent labels, dyes, and fluorophores) spectroscopic labels, en- 
zyme linked labels, radioactive labels, and labeled binding proteins. Additional labels are described in U.S. Pat. No. 
4,366,241 . 

[01 13] The detection methods used to determine where hybridization has taken place will typically depend upon the 
label selected above. Thus, for a fluorescent label a fluorescent detection step will typically be used. PCT publication 
35 no. WO90/1 5070 describes apparatus and mechanisms for scanning a substrate matrix using fluorescence detection, 
but a similar apparatus is adaptable for other optically detectable labels. 

[0114] The detection method provides a positional localization of the region where hybridization has taken place. 
However, the position is correlated with the specific sequence of the probe since the probe has specifically been at- 
tached or synthesized at a defined substrate matrix position. Having collected all of the data indicating the subsequent 
40 es present in the target sequence, this data may be aligned by overlap to reconstruct the entire sequence of the target, 
as illustrated above. 

[0115] It is also possible to dispense with actual labeling if some means for detecting the positions of interaction 
between the sequence specific reagent and the target molecule are available. This may take the form of an additional 
reagent which can indicate the sites either of interaction, or the sites of lack of interaction, e.g., a negative label. For 
45 the nucleic acid embodiments, locations of double strand interaction may be detected by the incorporation of interca- 
lating dyes, or other reagents such as antibody or other reagents that recognize helix formation, see, e.g., Sheldon, 
et al. (1986) U.S. Pat. No. 4,582,789. 

E. Analysis 

50 

[01 1 6] Although the reconstruction can be performed manually as illustrated above, a computer program will typically 
be used to perform the overlap analysis. A program may be written and run on any of a large number of different 
computer hardware systems. The variety of operating systems and languages useable will be recognized by a computer 
software engineer. Various different languages may be used, e.g., BASIC; C; PASCAL; etc. A simple flow chart of data 
55 analysis is illustrated in Figure 1 . 
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F. Substrate Reuse 



a second target may actually be applied to the recycled matrix and analyzed as before mteractlon - Thereafter, 



IV. FINGERPRINTING 
A. General 



B- Preparation of Substrate Matrix 



[0119] A collection of specific probes may be produced by either of the methods described above in the section on 
[0120] In one embodiment, the individually isolated probes may be attached to the matrix at defined notion* Th« 

[01 22] In another embodiment, a relatively short specific oligonucleotide is used which serves as a taraetinn r M n*nt 
for postoonally directing the sequence recognition reagent. For example, the sequence sp^c mS^ S 

[0123] ^ ert haseparatesubstrateattachedreagentsareattachedtothetargetingsegmentthetwoarecro S slink e d 
nSTHS ^ "ISS ^ SUbStra,e ' SUi,able Cr0SSlinkin9 ^ « known sete" . DaCS 

(1986^ 
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C. Labeling Target Nucleotides 

[0124] The labeling procedures used in the sequencing embodiments will also be applicable in the fingerprinting 
embodiments. However, since the fingerprinting embodiments often will involve relatively large target molecules and 

5 relatively short oligonucleotide probes, the amount of signal necessary to incorporate into the target sequence may be 
less critical than in the sequencing applications. For example, a relatively long target with a relatively small number of 
labels per molecule may be easily amplified or detected because of the relatively large target molecule size. 
[0125] In various embodiments, it may be desired to cleave the target into smaller segments as in the sequencing 
embodiments. The labeling procedures and cleavage techniques described in the sequencing embodiments would 

10 usually also be applicable here. 

D. Hybridization Conditions 

[0126] The hybridization conditions used in fingerprinting embodiments will typically be less critical than for the se- 
15 quencing embodiments. The reason is that the amount of mismatching which may be useful in providing the finger- 
printing information would typically be far greater than that necessary in sequencing uses. For example, Southern 
hybridizations do not typically distinguish between slightly mismatched sequences. Under these circumstances, im- 
portant and valuable Information may be arrived at with less stringent hybridization conditions while providing valuable 
fingerprinting information. However, since the entire substrate is typically exposed to the target molecule at one time, 
20 the binding affinity of the probes should usually be of approximately comparable levels. For this reason, if oligonucle- 
otide probes are being used, their lengths should be approximately comparable and will be selected to hybridize under 
conditions which are common for most of the probes on the substrate. Much as in a Southern hybridization, the target 
and oligonucleotide probes are of lengths typically greater than about 25 nucleotides. Under appropriate hybridization 
conditions, e.g., typically higher salt and lower temperature, the probes will hybridize irrespective of imperfect comple- 
25 mentality. In fact, with probes of greater than, e.g., about fifty nucleotides, the difference in stability of different sized 
probes will be relatively minor. 

[01 27] Typically the fingerprinting is merely for probing similarity or homology. Thus, the stringency of hybridization 
can usually be decreased to fairly low levels. See, e.g., Wetmur and Davidson (1 968) "Kinetics of Ren atu ration of DNA, 
■ J. Mol. Biol. , 31 :349-370; and Kanehisa, M. (1984) Nuc. Acids Res. , 12:203-213. 

30 

E. Detection; VLSIPS Scanning 

[0128] Detection methods will be selected which are appropriate for the selected label. The scanning device need 
not necessarily be digitized or placed into a specific digital database, though such would most likely be done. For 

35 example, the analysis in fingerprinting could be photographic. Where a standardized fingerprint substrate matrix is 
used, the pattern of hybridizations may be spatially unique and may be compared photographically. In this manner, 
each sample may have a characteristic pattern of interactions and the likelihood of identical patterns will preferably be 
such low frequency that the fingerprint pattern indeed becomes a characteristic pattern virtually as unique as an indi- 
vidual's fingertip fingerprint. With a standardized substrate, every individual could be, in theory, uniquely identifiable 

40 on the basis of the pattern of hybridizing to the substrate. 

[01 29] Of course, the VLSI PS scanning apparatus may also be useful to generate a digitized version of the f ingerprint 
pattern. In this way, the identification pattern can be provided in a linear string of digits. This sequence could also be 
used for a standardized identification system providing significant useful medical transferability of specific data. In one 
embodiment, the probes used are selected to be of sufficiently high resolution to measure polynucleotides encoding 

45 antigens of the major histocompatibility complex, it might even be possible to provide transplantation matching data 
in a linear stream of data. The fingerprinting data may provide a condensed version, or summary, of the linear genetic 
data, or any other information data base. 

F. Analysis 

50 

[01 30] The analysis of the fingerprint will often be much simpler than a total sequence determination. However, there 
may be particular types of analysis which will be substantially simplified by a selected group of probes. For example, 
probes which exhibit particular populational heterogeneity may be selected. In this way, analysis may be simplified and 
practical utility enhanced merely by careful selection of the specific probes and a careful matrix layout of those probes. 

55 

G. Substrate Reuse 

[0131] As with the sequencing application, the fingerprinting usages may also take advantage of the reusability of 
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H. Other Polynucleotide As pects 

X stru r of a partfcu,ar <■» 

toacellortissuetypetothee^^^ 
• [0133] RNaJLToS 

RNA may be labeled, for exampTby S^lffiSSS I? 7 ? ""'I'" 1 Ce " tacdon ° r 3 sample. The 

(e.g.. end-.abeled with T4 p*£S5S!EK aS fJSSf ? ^ ^ ° r " y UBtafl radi ° ,abe,6d RNA 

sequencesmaythenbeexposedtothepor,,^ 

The pattern of positions atwhfch labeled RNA 
' identify, and in some embodiments quanSta^Zl RnI "? 7 C ° mpared t0 3 reference P a « a " to 

as being characteristic of a particular <Seit£e ' ' 0r * 1(16 hybridization pattern itself 

first cell type. Similarly, an identical ViaPsXmltoSSSS 7 k f ^ W Mizat ™ » forthe 
obtained from a second cell type (e human 7 Iff may " e l ° a 'a°eled RNA sample 

cel. type. Labeled RNA JyttZ ZS VZTI* ' I?™" hybridiZaa ° n P3ttem f0r the second 
oligonucleotide substrate, and the rLuEbndLZl°nl T*" 0 " ^ hybridi2ed t0 an identical VLSIPS 
temsestablishedforthefrstandsecondc^ 

population can be identified as 2K£ r JXSS^SI^ I * ^ °' a ce " or ce » 

[0135, Where a positional* discrete X^^i^wrST!" T^ 0 " 
the cognate (complementary) labeled RNA soeciest th e h!hJf t " m °' ar 6XCess overthe °' 

to that VLSIPS .ocus (as measured by labeSZi at mSES^ - "*! -m ° Mm ° f Specmc ^ridization 
cognate RNA species present in the labeled 77i 2" Pr ° V ' de 9 0 uantltativa measurement of the 
oleotidesubstrate can provide lla^ 

or cell population, as well as the relative abundant J 7 v,dual .™ Aa P that are expressed in a particular cell 

-5,^^^ fr ° m **» bi ° PSieS - «P«*-y tumor biopsies 

information regarding eel C o^tf^ 

distinct CigonucleotL Z SZ lp^S^ RNA^t T*l S °™ ° f *• PosLally 
(e.g., c-myc, c-rasH, c -sis, e£.) which are ^^117^7^ 5717*7 ^ end °9 enoua proto-oncogens 
[0137] In addition to diagnostic appl^fcnsTS^ R t?' fr f ns f cnbed at elevated in neoplastic tissues, 
ized to VLSIPS oligonucleotide sub*S 

obtained with RNA from related, non-neoplastic cell ZoT^S 'f 0n P attem ( s ) ^P™* to reference patterns 
terns obtained with RNA from neoplasJc celfs Tm l^lr^T T °l * M ° nS betWeen the hybridization pat- 
may be of diagnostic value and nSJSS? RNA °2Ef S t ! ° btamed RNAfrom "^neoplastic cells 
therapeutic modamie, In fact, tl Thig X^JSZ'S SXST' ,77 ? ^ ^ f ° r n ° Vel 

Strata ^nucleotides will be a, .east « n, 
sequencesof theposftionaTy dilS^S^^^S: I??* * 25 nUCl6 ° tideS in len ^ ^ e 
of sequence data, including but not S tfcS^ 

random orpseudorandomseque!ces7orS t^Tno Ka^^h GenBank ' and ° r ma V "« include 
analysis of RNA expression patterns wi7^^^ 
that reflect predonLntly 

to slightly mismatched sequences ^SS^^L^^T" and/or ^oss-hybridization 

[0139] The ability to oenerate a ZhrffnlL « embodiments may be desirable. 

allowJ for the fJil+ZS^tt TaZtS ""*■ " -~ * Specific 

very powerful in providing the mean fo tesuno T Si' T 8 " nUmber ° f possib,e •"tenwtton.. This is 
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By using a fingerprinting method, it may be determined that ail members of that species are sufficiently similar in specific 
sequences that they can be easily identified as being within a particular group. Thus, newly defined classes may be 
resolved by their similarity in fingerprint patterns. Alternatively, a non-member of that group will fail to share those many 
identifying characteristics. However, since the technology allows testing of a very large number of specific interactions, 

5 it also provides the ability to more finely distinguish between closely related different cells or samples. This will have 
important applications in diagnosing viral, bacterial, and other pathological on nonpathological infections. 
[0140] In particular, cell classification may be defined by any of a number of different properties. For example, a cell 
class may be defined by its DNA sequences contained therein. This allows species identification for parasitic or other 
infections. For example, the human cell is presumably genetically distinguishable from a monkey cell, but different 

10 human cells will share many genetic markers. At higher resolution, each individual human genome will exhibit unique 
sequences that can define it as a single Individual. 

[0141] Likewise, a developmental stage of a cell type may be definable by its pattern of expression of messenger 
RNA. For example, in particular stages of cells, high levels of ribosomal RNA are found whereas relatively low levels 
of other types of messenger RNAs may be found. The high resolution distinguishability provided by this fingerprinting 
is method allows the distinction between cells which have relatively minor differences in Its expressed mRNA population. 
Where a pattern is shown to be characteristic of a stage, a stage may be defined by that particular pattern of messenger 
RNA expression. 

[0142] In another embodiment, a substrate as provided herein may be used for genetic screening. This would allow 
for simultaneous screening of thousands of genetic markers. As the density of the matrix is increased, many more 

20 molecules can be simultaneously tested. Genetic screening then becomes a simpler method as the present invention 
provides the ability to screen for thousands tens of thousands, and hundreds of thousands, even millions of different 
possible genetic features. However, the number of high correlation genetic markers for conditions numbers only in the 
hundreds. Again, the possibility for screening a large number of sequences provides the opportunity for generating the 
data which can provide correlation between sequences and specific conditions or susceptibility. The present invention 

25 provides the means to generate extremely valuable correlations useful for the genetic detection of the causative mu- 
tation leading to medical conditions. In still another embodiment, the present invention would be applicable to distin- 
guishing two individuals having identical genetic compositions. The antibody population within an individual is depend- 
ent both on genetic and historical factors. Each individual experiences a unique exposure to various infectious agents, 
and the combined antibody expression is partly determined thereby. Thus, individuals may also be fingerprinted by 

30 their lymphocyte DNA or RNA hybridization pattern(s). Similar sorts of immunological and environmental histories may 
be useful for fingerprinting, perhaps In combination with other screening-properties. 

[0143] With the definition of new classes of cells, a cell sorter will be used to purify them. Moreover, new markers 
for defining that class of cells will be identified. For example, where the class is defined by Its RNA content, cells may 
be screened by antisense probes which detect the presence or absence of specific sequences therein. Alternatively, 

35 cell lysates may provide information useful in correlating intracellular properties with extracellular markers which indi- 
cate functional differences. Using standard ceil sorter technology with a fluorescence or labeled antisense probe which 
recognizes the internal presence of the specific sequences of interest, the cell sorter will be able to isolate a relatively 
homogeneous population of cells possessing the particular marker. Using successive probes the sorting process should 
be able to select for cells having a combination of a large number of different markers. 

40 [0144] With the fingerprinted method as in identification means arises from mosaism problems in an organism. A 
mosaic organism is one whose genetic content in different cells is significantly different. Various clonal populations 
should have similar genetic fingerprints, though different clonal populations may have different genetic contents. See, 
for example, Suzuki et al. An Introduction to Genetic Analysis (4th Ed.), Freeman and Co., New York. However, this 
problem should be a relatively rare problem and could be more carefully evaluated with greater experience using the 

45 fingerprinting methods. 

[0145] The invention will also find use in detecting changes, both genetic and in protein expression (i.e., by RNA 
expression fingerprinting), in a rapidly "evolving 0 protozoan infection, or similarly changing organism. 

V. MAPPING 

50 

A. General 

[0146] The use of the present invention for mapping parallels its use for fingerprinting and sequencing. Mapping 
provides the ability to locate particular segments along the length of the polynucleotide. The mapping provides the 
55 ability to locate, in a relative sense, the order of various subsequences. This may be achieved using at least two different 
approaches. 

[0147] The first approach is to take the large sequence and fragment it at specific points. The fragments are then 
ordered and attached to a solid substrate. For example, the clones resulting from a chromosome walking process may 
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of this mapping procedure is contained in, e.g., Evans et al 1 989 "Physical Maoninn TnT™ n descn P tlon 



B. Preparation of Substrate Matrix 



[0149] The substrate may be generated in either of the methods generally applicable in the seouencino and fin„»r 
pr.nt.ng embodiments. The substrate may be made either synthetically, or by attaching otherTe S^be^ 

SZZIZ Pr ° beS orse « uences ™» * derived eitherfrom ^nttSTZ^nSST^ 

indicated above, the sol.d phase substrate synthetic methods may be utilized to generate a matrix vS oostoonaHv 

to be much longer. The processes for making a substrate which has longer oligonucleotide probes should no be 

C. Labeling 

D. Hybridization/Specific Interaction 

hybridization Usually, the hybridization conditions will be such that merely homologous segments w^Serart and 
provide a posits signal. Much like the fingerprinting embodiment. ^be^no^Jt^Z^oZ 
^successive incubations at higher stringency conditions. Or, a plurality of different probes, each having various S 
of homology may be used. In either way, the spectrum of homologies can be measured. 

E. Detection 

2„ ™ 6 ^tectionmemods used in the mapping procedure will be virtually identical to those used in the finqer- 
pnnhng embodiment. The detection methods will be selected in combination with the labeling methods 

F. Analysis 

LtlTnTK 9 T r *" 6XiStenCe ° f 3n interaCtion fe coup,ed with some °' the locatlo oHSe 

interaction. The interaction is mapped in some mannerto the physical polymer sequence. Some means for determininq 

or m~°i° nS ° Pr ° beS iS Perf0rmed - ™* may be achi6Ved b * s * nthesis ° f *e substra eTpTem 

rn«^ r " alyS,S ° f se " uences a «er they have been attached to the substrate 

Se ™!!!«on!T!h PTObeS be rand ° mly P ° Siti0ned at Vanous locations on the substrate. However, the 
relative positions of the vanous reagents in the original polymer may be determined by using short fragments e a 

probes are adjacent one another on the original target sequence and correlate that with positions on the matrix In thte 
way, the matrix is useful for determining the relative locations of various new segments in the origina target molecute 
This sort of analyse is described in Evans, and the related references described above 
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[0155] In another form of mapping, as described above in the fingerprinting section, the developmental map of a cell 
or biological system may be measured using fingerprinting type technology. Thus, the mapping may be along a temporal 
dimension rather than along a polymer dimension. The mapping or fingerprinting embodiments may also be used in 
determining the genetic rearrangements which may be genetically important, as in lymphocyte and B-cell development. 
5 In another example, various rearrangements or chromosomal dislocations may be tested by either the fingerprinting 
or mapping methods. These techniques are similar in many respects and the fingerprinting and mapping embodiments 
may overlap in many respects. 

G. Substrate Reuse 

10 

[0156] The substrate should be reusable in the manner described in the fingerprinting section. The substrate is 
renewed by removal of the specific interactions and is washed and prepared for successive cycles of exposure to new 
target sequences. 

15 VI. ADDITIONAL SCREENING AND APPLICATIONS 

A. Specific Interactions 

[0157] As originally indicated in the parent filing of VLSI PS, the production of a high density plurality of spatially 

20 segregated polymers provides the ability to generate a very large universe or repertoire of individually and distinct 
sequence possibilities. As indicated above, particular oligonucleotides may be synthesized in automated fashion at 
specific locations on a matrix. In fact, these oligonucleotides may be used to direct other molecules to specific locations 
by linking specific oligonucleotides to other reagents which are in batch exposed to the matrix and hybridized in a 
complementary fashion to only those locations where the complementary oligonucleotide has been synthesized on the 

25 matrix. This allows for spatially attaching a plurality of different reagents onto the matrix instead of individually attaching 
each separate reagent at each specific location. Although the caged biotin method allows the automated attachment, 
the speed of the caged biotin attachment process is relatively slow and requires a separate reaction for each reagent 
being attached. By use of the oligonucleotide method, the specificity of position can be done in an automated and 
parallel fashion. As each reagent is produced, instead of directly attaching each reagent at each desired position, the 

30 reagent may be attached to a specific desired complementary oligonucleotide which will ultimately be specifically di- 
rected toward locations on the matrix having a complementary oligonucleotide attached thereat. 
[0158] In addition, the technology allows screening for specificity of interaction with particular reagents. For example, 
the oligonucleotide sequence specificity of binding of a potential reagent may be tested by presenting to the reagent 
ail of the possible subsequences available for binding. Although secondary or higher order sequence specific features 

35 might not be easily screenable using this technology, it does provide a convenient, simple, quick, and thorough screen 
of interactions between a reagent and its target recognition sequences. See, e.g., Pfeifer et af. (1989) Science 246: 
810-812. 

[0159] For example, the interaction of a promoter protein with its target binding sequence may be tested for many 
different, or all, possible binding sequences. By testing the strength of interactions under various different conditions, 
. 40 the interaction of the promoter protein with each of the different potential binding sites may be analyzed. The spectrum 
of strength of interactions with each different potential binding site may provide significant insight into the types of 
features which are important in determining specificity. 

[01 60] An additional example of a sequence specific interaction between reagents is the testing of binding of a double 
stranded nucleic acid structure with a single stranded oligonucleotide. Often, a triple stranded structure Is produced 
45 which has significant aspects of sequence specificity. Testing of such interactions with either sequences comprising 
only natural nucleotides, or perhaps the testing of nucleotide analogs may be very important in screening for particularly 
useful diagnostic or therapeutic reagents. See, e.g., Haner and Dervan (1990) Biochemistry 29:9761-6765, and refer- 
ences therein. 

50 B. Sequence Comparisons 

[0161] Once a gene Is sequenced, the present invention provides means to compare alleles or related sequences 
to locate and identify differences from the control sequence. This would be extremely useful in further analysis of 
genetic variability at a specific gene locus. 

55 

C. Categorizations 

[0162] As indicated above in the fingerprinting and mapping embodiments, the present invention is also useful to 
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particular messenger RNAs. The screening pSduSS e lotion "T"^ UP ° n the ex P ressi °" - 
cells. In addition, the temporal development of SSl SiS ? •"T^" de ' initi ° n ° f " ew c,asses ° f 
vanous mRNAs. Means to simultaneously so reen a pTuraS or^l Pr6Sence or 6x P ress ™ °< 

The combination of different markers made ava^le 96 ° f SeqUences as Prided, 

lated cell types. Other markers may be aSSSiS^^^T^T' " dl ' Stin 9 uish fairi V *«V re- 

^rt^ «--^es; ssss? avaiiabie herein to define 

may also be used in defining cell classes and soZ S ,! , may feSUlt in s P eci,ic anti 9^ which 

be possfcle to select a class of omnipotent -V, for example, it should 

immunesystem. Based upon the cellular classes Mr^^^^ l ° "Tt** a human 

classes of cells navjng RMeZsLnTdS DM f T ,ableb y ^hnology. purified 

[0164] in an alternative embodiment subclasses of StmCtUre are made available - 

cel. surface RNA species. The I^E£^tZ!Z^ *° n ,he C ° mbinati0n of ex P ressad 

RNA species together. Thus, higher ^luSn^toSSS, ^m^T^^" 9 °' 3 P luralit y of Cerent 
the definitions and functional JL££££S^ ttsa 1 SUbClaSSeS beC ° mes possib,e and . •» 
types becomes available. This is applicabfc not oZTZ^l ZltTT^ ^ t0 ^ those ce « 
Many of the cells for which this would be most useful w Jbe immobt2 e T °! ' 0r * free,y circulatin 9 «**■ 
cells will be diagnosed or detected using m3naeil^h5 *' Md Part ' CUlar tissues or °<9*ns. Tumor 

the ability not only to define new classes of Site based udo unlnT ? eS " 7,16 P resent 'Mention also provides 
the ability to select orpurify PO P u^nToT<^^ s ^"fS °\ " but tt ate ° P'° vi ** 

orRNAmoleculesmaybeintroducedintoac^to^ 

American 262:40-46. detect RNA sequences therein. See, e.g. , Weintraub (1 990) Scientific 



30 D. Statistical Correlations 
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coisXa^ 

genetic screening methods, typically screen foS oS 

taneous screening for tens, hundreds, thousand ten fl^l^Tl ^ inVention a,,ows si ™<" 

different genetic sequences. Thus, applS th ° USandS - a " d SVen mi,lions ° f 
population allows detailed statistical EtetobTrnTTi metnods ° f the P^ent invention to a sufficiently large 
ular markers, typicaily g e „ et i c ^0^ 

correlation become much more easily performed WW iZ^Z ?? ?l 9 T ^ 96netic P re dictab«lity and 
is better tested. Particular markers wniS ^partial d iaanosno a f na 2 , ° f the P redictions «*» 

bilities will be identified and provide directton E 2 Zi 6s °1 T COnditions «' medical suscepti- 

course, as indicated above in the sequendn ^emSment 7 ^ ° f ,he marKers evolved. Of 

ing projects. For example, sequences ofTe enSre S^SSI f ' nd mUCh USe in in,ense seaue "°- 

plified and enabled by the present invention 9 " ,he hUma " gen0me P"*« wi « "a greatly sim- 



VI. FORMATION OF SUBSTRATE 



oahTsub^rr :rc!s::sr r 9e , nts H which are posftiona,iy - *• «*- 

strument will typical* be one San ^mat descnbed" in prr m " ^ Whi ° h pr ° duces the substra ^ The in- 
» scribed therein is dinactly appSte t the apSns usedL'r T T WO90/15070 - T " a instrumentation de- 
typically a silicon containing substrate on S oositio^ *" COm P rises a subs ^te, 
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way, masks may be used to photo-activate positions for attachment or synthesis of specific sequences on the substrate. 
These manipulations may be automated by the types of apparatus described in PCT publication no. WO90/15070. 
[0169] Selectively removable protecting groups allow creation of well defined areas of substrate surface having dif- 
fering reactivities. Preferably, the protecting groups are selectively removed from the surface by applying a specific 
s activator, such as electromagnetic radiation of a specific wavelength and intensity. More preferably, the specific activator 
exposes selected areas of surface to remove the protecting groups in the exposed areas. 

[0170] Protecting groups of the present invention are used in conjunction with solid phase oligonucleotide syntheses 
using deoxyribonucleic and ribonucleic acids. In addition to protecting the substrate surface from unwanted reaction, 
the protecting groups block a reactive end of the monomer to prevent self-polymerization. 
10 [0171] Attachment of a protecting group to the 5'-hydroxyl group of a nucleoside during synthesis using for example, 
phosphate-triester coupling chemistry, prevents the 5'-hydroxyl of one nucleoside from reacting with the 3 r -activated 
phosphate-triester of another. 

[01 72] Regardless of the specific use, protecting groups are employed to protect a moiety on a molecule from reacting 
with another reagent. Protecting groups of the present invention have the following characteristics: they prevent se- 

15 lected reagents from modifying the group to which they are attached; they are stable (that is, they remain attached) to 
the synthesis reaction conditions; they are removable under conditions that do not adversely affect the remaining 
structure; and once removed, do not react appreciably with the surface or surface-bound oligonucleotide. 
[0173] In a preferred embodiment, the protecting groups will be photoactivatable. The properties and uses of pho- 
toreactive protecting compounds have been reviewed. See, McCray et aL, Ann. Rev, of Biophys. and Biophys. Chem. 

20 (1989) V8:239-270. Preferably, the photosensitive protecting groups will be removable by radiation In the ultraviolet 
(UV) or visible portion of the electromagnetic spectrum. More preferably, the protecting groups will be removable by 
radiation in the near UV or visible portion of the spectrum. In some embodiments, however, activation may be performed 
by other methods such as localized heating, electron beam lithography, laser pumping, oxidation or reduction with 
microelectrodes, and the like. Sulfonyl compounds are suitable reactive groups for electron beam lithography. Oxidative 

25 or reductive removal is accomplished by exposure of the protecting group to an electric current source, preferably 
using microelectrodes directed to the predefined regions of the surface which are desired for activation. 
[0174] The density of reagents attached to a silicon substrate may be varied by standard procedures. The surface 
area for attachment of reagents may be increased by modifying the silicon surface. For example, a matte surface may 
be machined or etched on the substrate to provide more sites for attachment of the particular reagents. Another way 

30 to increase the density of reagent binding sites is to increase the derivitization density of the silicon. Standard proce- 
dures for achieving this are described, below. 

[01 75] One method to control the derivatization density is to highly derivatize the substrate with photochemical groups 
at high density. The substrate is then photolyzed for various predetermined times, which photoactivate the groups at 
a measurable rate, and react then with a capping reagent. By th is method, the density of linker groups may be modulated 

35 by using a desired time and intensity of photoactivation. 

[01 76] In many applications, the number of different sequences which may be provided may be limited by the density 
and the size of the substrate on which the matrix pattern is generated. In situations where the density Is insufficiently 
high to allow the screening of the desired number of sequences, multiple substrates may be used to increase the 
number of sequences tested. Thus, the number of sequences tested may be increased by using a plurality of different 

40 substrates. Because the VLSIPS apparatus is almost fully automated, increasing the number of substrates does not 
lead to a significant increase in the number of manipulations which must be performed by humans. This again leads 
to greater reproducibility and speed in the handling of these multiple substrates. 

A. Instrumentation 

45 

[01 77] The concept of using VLSI PS generally allows a pattern or a matrix of reagents to be generated. The procedure 
for making the pattern is performed by any of a number of different methods. An apparatus and instrumentation useful 
for generating a high density VLSIPS substrate is described in detail in PCT publication no. WO90/15070. 

50 B. Binary Masking 

[0178] For example, the binary masking technique allows for producing a plurality of sequences based on the se- 
lection of either of two possibilities at any particular location. By a series of binary masking steps, the binary decision 
may be the determination, on a particular synthetic cycle, whether or not to add any particular one of the possible 
55 subunits. By treating various regions of the matrix pattern in parallel, the binary masking strategy provides the ability 
to carry out spatially addressable parallel synthesis. 
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C. Synthetic Methods 



10 



[01 79] The construction of the matrix pattern on the substrate will typically be generated by the use of ohoto sensiH™ 
reagents By use of photo-lithographic optical methods, particular segments of the subsi can tetaSS 
l,ghttoact«ateordeact^ 

T SUfe StSPS 3t aPPr0priate tim6S With appr °' 3ria,e masks and with a'ppro'priaTe the 
substrates can have known polymers synthesized at positionally defined regions on the substrate Methods for svn 
thesmng vanous substrates are described in PCT publication no. WO90/15070. By a sequential series of tnese photo- 
exposure and reaction manipulations, a defined matrix pattern of known sequence, may be generated and is Sal 
referred to as a VLSIPS substrate. In the nucleic add synthesis embodiment, nucleosides used n the ^syntSo^ 
b V P hot °'y»ic methods will typically be one of the two forms shown below Y 
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B = Adenine, Cytosine, Guanine, or Thymine 
[0180] In I, the photolabile group atthe 5' position is abbreviated NV (nitroveratryl) and in II, the group is abbreviated 
NVOC (nrtroveratryl oxycarbonyl). Although not shown above, bases (adenine, cytosine, and guanine) contain exocyclic 
2 9 ™P S wh ' c u h must be Protected during DNA synthesis. Thymine contains no exocyclic NH 2 and therefore requires 
no protection. The standard protecting groups for these anaines are shown below: 



55 
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Adenine (A) cytosine <C) Guanine (G) 



[01 81 ] Other amides of the general formula 



0 

R « AJLKYU ARYL 



where R may be alkyl or aryl have been used. 

[0182] Another type of protecting group FMOC (9-fluorenyl methoxycarbonyl) is currently being used to protect the 
exocyciic amines of the three bases: 




-viz** 
C 

Adenine (A) cytosine (C) Guanine (G) 
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[01 83] The advantage of the FMOC group is that it is removed under mild conditions (dilute organic bases) and can 
be used for all three bases. The amide protecting groups require more harsh conditions to be removed (NHU/MeOH 
with heat). 

[0184] Nucleosides used as 5'-OH probes, useful in verifying correct VLSIPS synthetic function, have been the fol- 
lowing: 




[0185] These compounds are used to detect where on a substrate photolysis has occurred by the attachment of 
either III or V to the newly generated 5'-OH, In the case of III, after the phosphate attachment is made, the substrate 
is treated with a dilute base to remove the FMOC group. The resulting amine can be reacted with FITC and the substrate 
examined by fluorescence microscopy. This indicates the proper generation of a 5'-OH. In the case of compound IV 
afterthe phosphate attachment is made, the substrate is treated with FITC labeled streptavidin and the substrate again 
may be examined by fluorescence microscopy. Other probes, although not nucleoside based, have included the fol- 
lowing: 
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[01 86] The method of attachment of the first nucleoside to the surface of the substrate depends on the functionality 
of the groups at the substrate surface, if the surface is amine functionalized, an amide bond is made (see example 
below). 

fx) T i 

fx 

JJr -id 

20 [01 87] if the surface is hydroxy functionalized a phosphate bond is made (see example below) 
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[01 88] In both cases, the thymidine example is illustrated, but any one of the four phosphoramidite activated nucle- 
osides can be used in the first step. 

[0189] Photolysis of the photolabile group NV or NVOC on the 5' positions of the nucleosides is carried out at -362 
40 nm with an intensity of 14 mW/cm 2 for 10 minutes with the substrate side (side containing the photolabile group) 
immersed in dioxane. After the coupling of the next nucleoside is complete, the photolysis is repeated followed by 
another coupling until the desired oligomer is obtained. 

[0190] One of the most common 3'-o-protecting group is the ester, in particular the acetate 



V 



R=CH 3 , C 6 H 5 



[0191] The groups can be removed by mild base treatment 0.1 N NaOH/MeOH or ^CO^O/MeOH. 
[0192] Another group used most often is the silyl ether. 
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R V *g 55 CH 3 
R r R 2 . Rg = iPr 



15 ^ JfSS!!r be ^ neUtra ' C ° nditi0nS USin9 1 M tetra-n-butylammonium fluoride 

•position, 



under acid conditions. «»«™«um ..uonae in THF or 

[01 94] Related to photodeprotection, the nitroveratryl group could also be used to protect the 3'-. 
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[01 95] Here, light (photolysis) would be used to remove these protecting groups 
30 [01 96] A variety of ethers can also be used in the protection of the 3"-0-position. 
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[01 97] Removal of these groups usually involves acid or catalytic methods 

ELUULf ' th0U f . thS Spe f icity of inactions at particular locations will usually be homogeneous due to a homoge- 
neous polymer bang synthesized at each defined location, for certain purposes, it may be useful to have mixed po( y- 

a „ir ,me TST C ° lleCti0n ° f interaCti ° nS 0CCUfring at Specific deflned locaBons . or degene^re- 
ducng analogues which have been discussed above and show broad specificity in binding. Then, a positive interaction 
signal may result from any of a number of sequences contained therein interaction 

u2 9 L^ h" t tB T l ? meth ° d ° f 9 eneratin 9 a matrix P attem o" a substrate, preformed polymere may be individ- 
ually attached at particular sues on the substrate. This may be performed by individually attaching reagents one a a 

Sl'SSS ^ 00 1 maWX ' 3 Pr ° CeSS WhiCh maV be aUt ° mated - Anotner wa * of SeneLngCoslna £ 
TZ X * f p 00 3 ' S t0 haVe ' ndiVidUally SpeCifiC reagents wnicn interact w »" each s Poc»ic position 

?JZ h eXamP ' ohgonucleotides be synthesized at defined locations on the substrate. tL the 

substrate would have on its surface a plurality of regions having homogeneous oligonucleotides attached at each 

[0200] In particular, at least four different substrate preparation procedures are available for treating a substrate 

m 6 k *Z St ! nd3rd VLS ' PS meth ° d - P0lym6riC SUbStrates ' DuraporeTM, and synthetic beads or fibers Se 
treatment labeled -standard VLSIPS" method involves applying aminopropyltriethoxysllane to a glass surface 
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[0201] The polymeric substrate approach involves either of two ways of generating a polymeric substrate. The first 
uses a high concentration of aminopropyltriethoxysilane (2-20%) in an aqueous ethanol solution (95%). This allows 
the silane compound to polymerize both in solution and on the substrate surface, which provides a high density of 
amines on the surface of the glass. This density Is contrasted with the standard VLSI PS method. This polymeric method 
allows for the deposition on the substrate surface of a monolayer due to the anhydrous method used with the afore- 
mentioned silane. 

[0202] The second polymeric method involves either the coating or covalent binding of an appropriate acrylic acid 
polymer onto the substrate surface. In particular, e.g., in DNA synthesis, a monomer such as a hydroxypropylacrylate 
is used to generate a high density of hydroxyl groups on the. substrate surface, allowing for the formation of phosphate 
bonds. An example of such a compound is shown: 



[0203] The method using a Durapore™ membrane (Millipore) consists of a polyvinylidine difluoride coating with 
crosslinked polyhydroxylpropyl acrylate [PVDF-HPA]: 



Here the building up of, e.g., a DNA oligomer, can be started immediately since phosphate bonds to the surface can 
be accomplished in the first step with no need for modification. A nucleotide diner (5'-C-T-3') has been successfully 
made on this substrate in our labs. 

[0204] The fourth method utilizes synthetic beads or fibers. This would use another substrate, such as a teflon co- 
polymer graft bead orfiber, which is covalently coated with an organic layer (hydrophilic) terminating in hydroxyl sites 
(commercially available from Molecular Brosystems, Inc.) This would offer the same advantage as the Durapore™ 
membrane, allowing for immediate phosphate linkages, but would give additional contour by the 3-dimensional growth 
of oligomers. 

[0205] A matrix pattern of new reagents may be targeted to each specific oligonucleotide position by attaching a 
complementary oligonucleotide to which the substrate bound form is complementary. For instance, a number of regions 
may have homogeneous oligonucleotides synthesized at various locations. Oligonucleotide sequences complementary 
to each of these can be individually generated and linked to a particular specific reagents. Often these specific reagents 
will be antibodies. As each of these is specific for finding its complementary oligonucleotide, each of the specific rea- 
gents will bind through the oligonucleotide to the appropriate matrix position. A single step having a combination of 
different specific reagents being attached specifically to a particular oligonucleotide will thereby bind to its complement 
at the defined matrix position. The oligonucleotides will typically then be covalently attached, using, e.g., an acridine 
dye, for photocrosslinking. Psoralen is a commonly used acridine dye for photocrosslinklng purposes, see, e.g., Song 
et al. (1979) Photochem. Photobiol. 29:1177-1197; Cimino et al. (1985) Ann. Rev. Biochem. 54:1151-1193; Parsons 
(1 980) Photochem. Photobiol. 32:813-821 ; and Dattagupta et al. (1985) U.S. Pat. No. 4,542,102, and (1987) U.S. Pat. 
No. 4,713,326. This method allows a single attachment manipulation to attach all of the specific reagents to the matrix 
at defined positions and results in the specific reagents being homogeneously located at defined positions. 
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D. Surface Immobilization 



1 . caged biotin 



222, toJSZSZ °u ^ " 9 rea9entS 3 P° sitional| y defined Pattern is to use a caged biotin 

h« ! 1 add,t, ° nal | d 1 eta 1 ! 3 ° n the chemistr V and ^P"'^ 10 " of ^Sed biotin embodiments. In short, the caged biotin 
has a photosensitive block.ng moiety which prevents the combination of avidin to biotin. At positions where the photo- 
Whographro process has removed the blocking group, high affinity biotin sites are generated. Thus by a sequential 
senes of photolithographic deblocking steps interspersed with exposure of those regLs to wrorftebwZ'X 
.ng reagents, only those locations where the deblocking takes place will form an avidin-biotin interaction uSSZSl 
avidin-b.otin binding is very tight, this will usually be virtually irreversible binding Because the 



2. crosslinked interactions 



S ™t a« ,m h 7 b ! IZat ' 0n also take P |ace b y Photocrosslinking of defined oligonucleotides linked to 
specfic reagents. After hybnd.zat.on of the complementary oligonucleotides, the oligonucleotides may be crosslinked 
by a reagent by psoralen or another similar type of acridine dye. Other useful crosslinking reagents are described in 
Dattaguptaetal. (1985) U.S. Pat. No. 4,542,102, and (1987) U.S. Pat. No. 4,713,326. aescnoed ,n 

[0208] In another embodiment, colony or phage plaque transfer of biological polymers may be transferred directly 
TeVT" IT. h F0 : eXam P |e ' a colon y P |ate ma * tnwton* onto a substrate having a generic oligonS 
Sequence whjch hybnd.zes to anothergenericcomplementary sequence containedonallofthevectorsintowhich 
nserts are .cloned. Th,s will specific^ only bind those molecules which are actually contained in the vectors containing 
r^ln ! c ° m P |em f ntar y se( ' uence - This immobilization allows for producing a matrix onto which a sequence specific 
reagent car i bind or for other purposes. In a further embodiment, a plurality of different vectors each having a specific 
oligonucleotide attached to the vector may be specifically attached to particular regions on a matrix having a comple- 
mentary oligonucleotide attached thereto. a "»npie 

VIII. HYBRIDIZATION/SPECIFIC INTERACTION 
A. General 

[0209] As discussed previously in the VLSIPS parent applications, the VLSIPS substrates may be used for screeninq 
for specific interactions with sequence specific targets or probes. screening 

™L', n h additi °t! l ! e availability ° f substrates havin 9 tne entire repertoire of posstole sequences of a defined length 
TZZ P° s f % °f sequencing by hybridization. This sequence may be de novo determination of an unknown 
sequence, particularly of nucleic acid, verification of a sequence determined by another method, or an investigation of 
changes in a prev.ously sequenced gene, locating and identifying specific changes. For example, often Maxam and 
G. bert sequencing techniques are applied to sequences which have been determined by Sanger and Coulson Each 
of those sequencing technologies have problems with resolving particular types of sequences. Sequencing by hybrid- 
slZeZ&KK 38 3 third independent method for verif y in 3 otner sequencing techniques See, e.g., (1988) 

additi ° n ' thS t0 pr ° Vide a large re P ertoire of Particular sequences allows use of short subsequence 
and hybridization as a means to fingerprint a polynucleotide sample. For example, fingerprinting to a high degree of 
specificity of sequence matching may be used for identifying highly similar samples, e.g., those exhibiting high homol- 
ogy to the selected probes. This may provide a means for determining classifications of particular sequences This 
should altow determination of whether particular genomes of bacteria, phage, or even higher cells might be related to 
on© snoiner. 

JSVOL J? addition ' fingerprinting may be used to identify an individual source of biological sample. See, e.q Under 
E h (1989) Nature, 339:501 -505, and references therein. For example, a DNA fingerprint may be used to determine 
whether a genetic sample arose from another individual. This would be particularly useful in various sorts of forensic 
tests to determ.ne, e.g., paternity or sources of blood samples. Significant detail on the particulars of genetic finger- 
printing for identification purposes are described in, e.g., Morris et al. (1989) "Biostatistical evolution of evidence from 
^".-.T^of-! f [ equency disWb "tion DNA probes in reference to disputed paternity of identity," J. Forensic S cience 

34:1311-1317; and Neufeld et al. (1990) Scientific American 262:46-53. 

[0213] In another embodiment, a fingerprinting-like procedure may be used for classifying cell types by analyzinq a 
pattern of specific nucleic acids present in the cell, specifically RNA expression patterns. This may also be useful in 
def inmg the temporal stage of development of cells, e.g.. stem cells or other cells which undergo temporal changes in 
development. For example, the stage of a cell, or group of cells, may be tested or defined by isolating a sample of 
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mRNA from the population and testing to see what sequences are present in messenger populations. Direct samples, 
or amplified samples (e.g., by polymerase chain reaction), may be used. Where particular mRNA or other nucleic acid 
sequences may be characteristic of or shown to be characteristic of particular developmental stages, physiological 
states, or other conditions, this fingerprinting method may define them. 
5 [0214] The present invention may also be used for mapping sequences within a larger segment. This may be per- 
formed by at least two methods, particularly in reference to nucleic acids. Often, enormous segments of DNA are 
subcloned into a large plurality of subsequences. Ordering these subsequences may be important in determining the 
overlaps of sequences upon nucleotide determinations. Mapping may be performed by immobilizing particularly large 
segments onto a matrix using the VLSI PS technology. Alternatively, sequences may be ordered by virtue of subse- 
ts quences shared by overlapping segments. See, e.g., Craig et al. (1990) Nuc. Acids Res. 18:2653-2660; Michiels et 
al. (1987) CABIOS 3:203-210; and Olson et al. (1986) Proc. Natl. Acad. Sci. USA 83:7826-7830. 

B. Important Parameters 

15 [0215] The extent of specific interaction between reagents immobilized to the VLSI PS substrate and another se- 
quence specific reagent may be modified by the conditions of the interaction. Sequencing embodiments typically require 
high fidelity hybridization and the ability to discriminate perfect matching from imperfect matching. Fingerprinting and 
mapping embodiments may be performed using less stringent conditions, or in some embodiments very highly stringent 
conditions, depending upon the circumstances. 

20 [021 6] In a nucleic acid hybridization embodiment, the specificity and kinetics of hybridization have been described 
in detail by, e.g., Wetmur and Davidson (1968) J. Moj. Biol. , 31:349-370, Britten and Kohne (1968) Science 161: 
529-530, and Kanehisa, (1 984) Nuc. Acids Res. 1 2:203-21 3. Parameters which are well known to affect specificity and 
kinetics of reaction include salt conditions, ionic composition of the solvent, hybridization temperature, length of oligo- 
nucleotide matching sequences, guanine and cytosine (GC) content, presence of hybridization accelerators, pH, spe- 

25 cific bases found in the matching sequences, solvent conditions, and addition of organic solvents. 

[0217] In particular, the salt conditions required for driving highly.mismatched sequences to completion typically in- 
clude a high salt concentration. The typical salt used is sodium chloride (NaCI), however, other ionic salts may be 
utilized, e.g., KCI. Depending on the desired stringency hybridization, the salt concentration will often be less than 
about 3 molar, more often less than 2.5 molar, usually less than about 2 molar, and more usually less than about 1 .5 

30 molar. For applications directed towards higher stringency matching, the salt concentrations would typically be lower. 
Ordinary high stringency conditions will utilize salt concentration of less than about 1 molar, more often less then about 
750 millimolar, usually less than about 500 millimolar, and may be as low as about 250 or 150 millimolar. 
[0218] The kinetics of hybridization and the stringency of hybridization both depend upon the temperature at which 
the hybridization is performed and the temperature at which the washing steps are performed. Temperatures at which 

35 steps for low stringency hybridization are desired would typically be lower temperatures, e.g., ordinarily at least about 
1 5°C, more ordinarily at least about 20°C, usually at least about 25°C, and more usually at least about 30°C. For those 
applications requiring high stringency hybridization, or fidelity of hybridization and sequence matching, temperatures 
at which hybridization and washing steps are performed would typically be high. For example, temperatures in excess 
of about 35°C would often be used, more often in excess of about 40°C, usually at least about 45°C, and occasionally 

40 even temperatures as high as about 50°C or 60°C or more. Of course, the hybridization of oligonucleotides may be 
disrupted by even higher temperatures. Thus, for stripping of targets from substrates, as discussed below, temperatures 
as high as 80°C, or even higher may be used. 

[0219] The base composition of the specific oligonucleotides involved in hybridization affects the temperature of 
melting, and the stability of hybridization as discussed in the above references. However, the bias of GC rich sequences 
45 to hybridize faster and retain stability at higher temperatures can be compensated for by the inclusion inthe hybridization 
incubation or wash steps of various buffers. Sample buffers which accomplish this result include the triethly-and trime- 
thyl ammonium buffers. See, e.g., Wood et al. (1987) Proc. Natl. Acad. Sci. USA , 82:1585-1588, and Khrapko, K. et 
al. (1 989) FEBS Letters 256:11 8-122. 

[0220] The rate of hybridization can also be affected by the inclusion of particular hybridization accelerators. These 
so hybridization accelerators include the volume exclusion agents characterized by dextran sulfate, or polyethylene glycol 
(PEG). Dextran sulfate is typically included at a concentration of between 1% and 40% by weight. The actual concen- 
tration selected depends upon the application, but typically a faster hybridization is desired in which the concentration 
is optimized for the system in question. Dextran sulfate is often included at a concentration of between 0.5% and 2% 
by weight or dextran sulfate at a concentration between about 0.5% and 5%. Alternatively, proteins which accelerate 
55 hybridization may be added, e.g., the recA protein found in E. coli) or other homologous proteins. 

[0221] Of course, the specific hybridization conditions will be selected to correspond to a discriminatory condition 
which provides a positive signal where desired but fails to show a positive signal at affinities where interaction is not 
desired. This may be determined by a number of titration steps or with a number of controls which will be run during 
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TJSSS^ washing steps ,0 determine at what point the hybridi2a,ion conditions have reached the «*>• 

IX. DETECTION METHODS 

[0222] Methods for detection depend upon the label selected. The criteria for selecting an appropriate label are 
discussed below, however, a fluorescent label is preferred because of its extreme sensitivity and s^plict Standi 
labeling procedures are used to determine the posftions where interactions between a sequence and a roagem ^ake 
place. For example, .f a target sequence is labeled and exposed to a matrix of different probes only thoTe SSnl 
where probes do interact with the target will exhibit any signal. Alternately, other methods may be used to She 
matnx to determ.ne where interaction takes place. Of course, the spectrum of interactions may be determi^d in a 
tempo ral manner by repeated scans of interactions which occur at each of a multiplicity of condiuons. HowevT insiad 

A. Labeling Techniques 

[0223] The target polynucleotide may be labeled by any of a number of convenient detectable markers. A fluorescent 
labe is preferred because it provides a very strong signal with low background. It is also optically detectable at hiqh 
«^^-^lh^.qulck«»nn^ procedure. Other potentia. labeling moieues incLe, radioisoiopeV 
andtteden^^^^ 

[0224] Another method for labeling does not require incorporation of a labeling moiety. The target may be exposed 
to the probes and a double strand hybrid is formed at those posrtions only. Addition of a double strand specific Regent 
will detect where hybridization takes place. An interoalative dye such as ethidium bromide may be used as long as the 
STfo 65 , m n °i f0,d b3Ck 00 themselves t0 a si 9" i,ica "t «*nt forming hairpin loops. See, e.g., Sheldon et 
iff"; J* N t °; 4.582.789. However, the length of the hairpin loops in short oligonucleotide probes woufd 
typically be insufficient to form a stable duplex. 

[0225] In another embodiment, different targets may be simultaneously sequenced where each target has a different 
abe . For mstance, one target could have a green fluorescent label and a second target could have a red fluorescent 

label. The scann.ng step will distinguish sites of binding of the red label from those binding the green fluorescent label 

Each sequence can be analyzed independently from one another. escenuaoei. 

f 02 !.? S f 3ble chrom °9 ens wi " include molecules and compounds which absorb light in a distinctive range of wave- 
engths so that a color may be observed, oremit lightwhen irradiated with radiation of a particular wave length orwave 

length range, e.g., fluoresceins. 

[0227] A wide variety of suitable dyes are available, being primary chosen to provide an intense color with minimal 
»S H V r SUrroundin 9 s - lllustrative dye include quinoline dyes, triarylmethane dyes, acridine dyes, 
alizarine dyes, phthale.ns, .nsect dyes, azo dyes, anthraquinoid dyes, cyanine dyes, phenazathionium dyes, and phen - 
azoxonium dyes. 

[0228] A wide variety of f luorescers may be employed either by themselves or in conjunction with quencher mole- 

wL^TTh ' nterc !o a " int ° 9 Vari6ty ° f Categ ° ries having certain P rimar V functionalities. These primary 
functionalities include 1- and 2-am.nonaphthalene, p.p'-diaminostilbenes, pyrenes, quaternary phenanthridine salt? 
9-am.noacnd.ries, p,p -diam.nobenzopnenone imines, anthracenes, oxacarbocyanine, merocyanine, 3-aminoequilen- 
in. perylene bis-benzoxazole, bis-p-oxazolyl benzene, 1 ,2-benzophenazin, retinol, bis-3-aminopyridinium salts, helle- 
brigemn, tetracycline, sterophenol, benzimidzaolylphenylamine, 2-oxo-3-chramen, indole, xanthen 7-hydroxycou- 
mann phenoxazine, salicylate, strophanthidin, porphyrins, triarylmethanes and flavin. Individual fluorescent com- 
pounds which have functionalities for linking or which can be modified to incorporate such functionalities include e g 
dansyl chlonde; fluoresceins such as 3,6-dihydroxy-9-phenylxanthhydrol; rtiodamineisothiocyanate; N-phenyl 1-arni- 
no-8-sulfonatonaphthalene; N-phenyl 2-amino-6-sulfonatonaphthalene; 4-acetamido-4-isothiocyanato-stilbene-2 2- 
d.surfon.c ac,d; pyrene-3-sulfonic acid; 2-toluidinonaphthalene-6-sulfonate; N-phenyl, N-methyl 2-aminoaphthalene- 
6-sulfonate; ethidium brom.de; stebrine; auromine-0,2-(9'-anthroyl)palmitate; dansyl phosphatidylethanolamine; N N'- 
dioc adecy oxacarbocyanine; N.N'-dihexyl oxacarbocyanine; merocyanine, 4-(3'py re nyl)butyrate; d-3-aminodesoxy- 
IZ /] "TTW a ? tS; 2 - metn y |an,hracene ; 9-vinylanthracene; 2.2'-(vinylene^-phenylene)bisbenzoxa- 

zole. p-bis[2-(4-methyl-5-phenyl-oxazolyl)]benzene; 6-dimethylamino-1 ,2-benzophenazin; retinol; bis(3'-aminopyridin- 
lum) 1 ,10-decand.yl diiodide; sulfonaphthylhydrazone of hellibrienin; chlorotetracycline; N-(7-dimethylamino-4-methvl- 
2-oxo-3-chromenyl)maleimide;N-[p-(2-benzimidazolyl)-phenyl]maleimide;N-(4-fluo ra nthyl)maleimide;bis(homovanil- 
, a o!^w SaZann: 4 - chlor °- 7 -nitro-2,1,3-benzooxadiazole; merocyanine 540; resorufin; rose bengal; and 2 4-diphe- 
nyl-3(2n)-furanone. 
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[0229] Desirably, fluorescers should absorb light above about 300 nm, preferably about 350 nm, and more preferably 
above about 400 nm, usually emitting at wavelengths greater than about 1 0 nm higher than the wavelength of the light 
absorbed. It should be noted that the absorption and emission characteristics of the bound dye may differ from the 
unbound dye. Therefore, when referring to the various wavelength ranges and characteristics of the dyes, it is intended 
5 to indicate the dyes as employed and not the dye which is unconjugated and characterized in an arbitrary solvent. 
[0230] Fluorescers are generally preferred because by irradiating a fluorescer with light, one can obtain a plurality 
of emissions. Thus, a single label can provide for a plurality of measurable events. 

[0231 ] Detectable signal may also be provided by chemiluminescent and bioluminescent sources. Chemiluminescent 
sources include a compound which becomes electronically excited by a chemical reaction and may then emit light 

10 which serves as the detectible signal or donates energy to a fluorescent acceptor. A diverse number of families of 
compounds have been found to provide chemiluminescence under a variety of conditions. One family of compounds 
is 2,3-dihydro-1,-4-phthalazinedione. The most popular compound is luminol, which is the 5-amino compound. Other 
members of the family include the 5-amino-6,7,8-triiTiethoxy- and the dimethylamino[ca]benz analog. These com- 
pounds can be made to luminesce with alkaline hydrogen peroxide or calcium hypochlorite and base. Another family 

15 of compounds is the 2,4,5-triphenylimidazoles, with lophine as the common name for the parent product. Chemilumi- 
nescent analogs include para-dimethylamino and -methoxy substituents. Chemiluminescence may also be obtained 
with oxalates, usually oxalyl active esters, e.g., p-nitrophenyl and a peroxide, e.g., hydrogen peroxide, under basic 
conditions. Alternatively, luciferins may be used in conjunction with luciferase or lucigenins to provide bioluminescence. 
[0232] Spin labels are provided by reporter molecules with an unpaired electron spin which can be detected by 

20 electron spin resonance (ESR) spectroscopy. Exemplary spin labels include organic free radicals, transitional metal 
complexes, particularly vanadium, copper, iron, and manganese, and the like. Exemplary spin labels Include nitroxide 
free radicals. 

B. Scanning System 

25 

[0233] With the automated detection apparatus, the correlation of specific positional labeling is converted to the 
presence on the target of sequences for which the reagents have specificity of interaction. Thus, the positional infor- 
mation is directly converted to a database indicating what sequence interactions have occurred. For example, in a 
nucleic acid hybridization application, the sequences which have interacted between the substrate matrix and the target 

30 molecule can be directly listed from the positional information. The detection system used is described in PCT publi- 
cation no. WO90/15070. Although the detection described therein is a fluorescence detector, the detector may be 
replaced by a spectroscopic or other detector. The scanning system may make use of a moving detector relative to a 
fixed substrate, a fixed detector with a moving substrate, or a combination. Alternatively, mirrors or other apparatus 
can be used to transfer the signal directly to the detector. 

35 [0234] The detection method will typically also Incorporate some signal processing to determine whether the signal 
at a particular matrix position is a true positive or may be a spurious signal. For example, a signal from a region which 
has actual positive signal may tend to spread over and provide a positive signal in an adjacent region which actually 
should not have one. This may occur, e.g., where the scanning system is not properly discriminating with sufficiently 
high resolution in its pixel density to separate the two regions. Thus, the signal over the spatial region may be evaluated 

40 pixel by pixel to determine the locations and the actual extent of positive signal. A true positive signal should, in theory, 
show a uniform signal at each pixel location. Thus, processing by plotting number of pixels with actual signal intensity 
should have a clearly uniform signal intensity. Regions where the signal intensities show a fairly wide dispersion, may 
be particularly suspect and the scanning system may be programmed to more carefully scan those positions. 
[0235] In another embodiment, as the sequence of a target is determined at a particular location, the overlap for the 

45 sequence would necessarily have a known sequence. Thus, the system can compare the possibilities for the next 
adjacent position and look at these in comparison with each other. Typically, only one of the possible adjacent sequenc- 
es should give a positive signal and the system might be programmed to compare each of these possibilities and select 
that one which gives a strong positive. In this way, the system can aiso simultaneously provide some means of meas- 
uring the reliability of the determination by indicating what the average signal to background ratio actually is. 

so [0236] More sophisticated signal processing techniques can be applied to the initial determination of whether a pos- 
itive signal exists or not. 

[0237] From a listing of those sequences which interact, data analysis may be performed on a series of sequences. 
For example, in a nucleic acid sequence application, each of the sequences may be analyzed for their overlap regions 
and the original target sequence may be reconstructed from the collection of specific subsequences obtained therein. 
55 Other sorts of analyses for different applications may also be performed, and because the scanning system directly 
interfaces with a computer the information need not be transferred manually. This provides for the ability to handle 
large amounts of data with very little human intervention. This, of course, provides significant advantages over manual 
manipulations. Increased throughput and reproducibility is thereby provided by the automation of vast majority of steps 
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in any of these applications. 



DATA ANALYSIS 



A. General ' 



[0238] Date analysis will typically involve aligning the proper sequences with their overlaps to determine the taraet 
sequence. Although the target "sequence" may not specifically con-espond to any specific molecule, especially where 
the target sequence is broken and fragmented up in thesequencing process, the sequence corresponds to a contiguous 
sequence of the subfragments. wnuguous 

S ^SLSLtS a 7K S « C o o n be P erf0mied b * a com P uter usi "9 an appropriate program. See, e.g.. Drmanac, R. et 
al. (1989)|en2mic84:114-128; and a commercially available analysis program available from the Genetic Engineering 
Center, P.O. Box 794, 1 1 000 Belgrade, Yugoslavia. Although the specific manipulations necessary to reassemble the 
target sequence from fragments may take many forms, one embodiment uses a sorting program to sort all of the 
subsequences using a defined hierarchy. The hierarchy need not necessarily correspond to any physical hierarchy 
but provides a means to determine, in order, which subfragments have actually been found in the target sequence In 
this manner, overlaps can be checked and found directly rather than having to search throughout the entire set after 
each selection process. For example, where the oligonucleotide probes are 1 0-mers, thefirst 9 positions can be sorted 
A Particular subsequence can be selected as in the examples, to determine where the process starts. As analogous 
to the theoretical example provided above, the sorting procedure provides the ability to immediately find the position 
of the subsequence which contains thefirst 9 positions and can compare whetherthere exists more than 1 subsequence 
during the first 9 positions. In fact, the computer can easily generate all of the possible target sequences which contain 
given comb.nat.on of subsequences. Typically there will be only one, but in various situations, there will be more 
[0240] An exemplary flow chart for a sequencing program is provided in Figure 1 . In general terms, the program 
provides for automated scanning of the substrate to determine the positions of probe and target interaction Simple 
processing of the intensity of the signal may be incorporated to filter out clearly spurious signals. The positions with 
positive interaction are correlated with the sequence specificity of specific matrix positions, to generate the set of 
matching subsequences. This information is further correlated with other target sequence information, e g restriction 
fragment analys.s. The sequences are then aligned using overlap data, thereby leading to possible corresponding 
target sequences which will, optimally, correspond to a single target sequence. 

B. Hardware 

[0241] * variety of computer systems may be used to run a sequencing program. The program may be written to 
provide both the detecting and scanning steps together and will typicalty be dedicated to a particular scanning appa- 
ratus. However, the components and functional steps may be separated and the scanning system may provide an 
output, e.g., through tape or an electronic connection into a separate computer which separately runs the sequencing 
analysis program. The computer may be any of a number of machines provided by standard computer manufacturers 
e.g., IBM compatible machines, Apple™ machines, VAX machines, and others, which may often use a UNIX™ oper- 
ating system. Alternatively, custom computing architectures may be employed, these architectures may include neural 
network methods implemented in hardware and/or software. Of course, the hardware used to run the analysis program 
will typically determine what programming language would be used. 

C. Software 

[0242] Software would be readily developed by a person of ordinary skill in the programming art, following the flow 
chart provided, or based upon the input provided and the desired result. 

[0243] Of course, an exemplary embodiment is a polynucleotide sequence system. However, the theoretical and 
mathematical manipulations necessary for data analysis of other linear molecules are conceptually similar. 

XI. SUBSTRATE REUSE 

[0244] Where a substrate is made with specific reagents that are relatively insensitive to the handling and processing 
steps involved in a single cycle of use, the substrate may often be reused. The target molecules are usually stripped 
off of the solid phase specific recognition molecules. Of course, it is preferred that the manipulations and conditions 
be selected as to be mild and to not affect the substrate. For example, if a substrate is acid labile, a neutral pH would 
be preferred in all handling steps. Similar sensitivities would be carefully respected where recycling is desired 
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A. Removal of Label 

[0245] Typically for a recycling, the previously attached specific interaction would be disrupted and removed. This 
will typically involve exposing the substrate to conditions under which the interaction between probe and target is 
5 disrupted. Alternatively, it may be exposed to conditions where the target is destroyed. For example, where the probes 
are oligonucleotides and the target is a polynucleotide, a heating and low salt wash will often be sufficient to disrupt 
the interactions. Additional reagents may be added such as detergents, and organic or inorganic solvents which disrupt 
the interaction between the specific reagents and target. 

10 B. Storage and Preservation 

[0246] As indicated above, the matrix will typically be maintained under conditions where the matrix itself and the 
linkages and specific reagents are preserved. Various specific preservatives may be added which prevent degradation. 
For example, if the reagents are acid or base labile, a neutral pH buffer will typically be added. It is also desired to 
15 avoid destruction of the matrix by growth of organisms which may destroy organic reagents attached thereto. For this 
reason, a preservative such as cyanide or azide may be added. However, the chemical preservative should also be 
selected to preserve the chemical nature of the linkages and other components of the substrate. Typically, a detergent 
may also be included. 

20 C. Processes to Avoid Degradation of Oligomers 

[0247] In particular, a substrate comprising a large number of oligomers will be treated in a fashion which is known 
to maintain the quality and integrity of oligonucleotides. These include storing the substrate in a carefully controlled 
environment under conditions of lower temperature, cation depletion (EDTA and EGTA), sterile conditions, and inert 
25 argon or nitrogen atmosphere. 

XII. INTEGRATED SEQUENCING STRATEGY 

A. Initial Mapping Strategy 

30 

[0248] As indicated above, although the VLSI PS may be applied to sequencing embodiments, it is often useful to 
integrate other concepts to simply the sequencing. For example, nucleic acids may be easily sequenced by careful 
selection of the vectors and hosts used for amplifying and generating the specific target sequences. For example, it 
may be desired to use specific vectors which have been designed to interact most efficiently with the VLSI PS substrate. 

35 This is also important In fingerprinting and mapping strategies. For example, vectors may be carefully selected having 
particular complementary sequences which are designed to attach to a genetic or specific oligomer on the substrate. 
This is also applicable to situations where it is desired to target particular sequences to specific locations on the matrix. 
[0249] In one embodiment, unnatural oligomers may be used to target natural probes to specific locations on the 
VLSIPS substrate. In addition, particular probes may be generated for the mapping embodiment which are designed 

40 to have specific combinations of characteristics. For example, the construction of a mapping substrate may depend 
upon use of another automated apparatus which takes clones isolated from a chromosome walk and attaches them 
individually or in bulk to the VLSIPS substrate. 

[0250] In another embodiment, a variety of specific vectors having known and particular 'targeting" sequences ad- 
jacent the cloning sites may be individually used to clone a selected probe, and the Isolated probe will then be targetable 
45 to a site on the VLSIPS substrate with a sequence complementary to the "target" sequence. 

B. Selection of Smaller Clones 

[0251] In the fingerprinting and mapping embodiments, the selection of probes may be very important. Significant 
50 mathematical analysis may be applied to determine which specific sequences should be used as those probes. Of 
course, for fingerprinting use, sequences that show significant heterogeneity across the human population would be 
preferred. Selection of the specific sequences which would most favorably be utilized will tend to be single copy se- 
quences within the genome, and more specifically single copy sequences that have low cross-hybridization potential 
to other sequences in the genome (i.e., not members of a closely-related multigene family). 
55 [0252] Various hybridization selection procedures may be applied to select sequences which tend not to be repeated 
within a genome, and thus would tend to be conserved across individuals. For example, hybridization selections may 
be made for non-repetitive and single copy sequences. See, e.g., Britten and Kohne (1 968) "Repeated Sequences in 
DNA," Science 161 :529-540. On the other hand, it may be desired under certain circumstances to use repeated se- 
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quences. For example, where a fingerprint may be used to identify or distinguish different species, or where repetitive 
sequences may be diagnostic of specific species, repetitive sequences may be desired for inclusion in the fingerprinting 
probes. In either case, the sequencing capability will greatly assist in the selection of appropriate sequences to be 
used as probes. 

[0253] Also as indicated above, various means for constructing an appropriate substrate may involve either mechan- 
ical or automated procedures. The standard VLSIPS automated procedure involves synthesizing oligonucleotides or 
short polymers directly on the substrate. In various other embodiments, it is possible to attach separately synthesized 
reagents onto the matrix in an ordered array. Other circumstances may lend themselves to transfer a pattern from a 
petn plate onto a solid substrate. Also, there are methods for site specifically directing collections of reagents to specific 
locations using unnatural nucleotides or equivalent sorts of targeting molecules. 

[0254] While a brute force manual transfer process may be utilized sequentially attaching various samples to suc- 
cessive positions, instrumentation for automating such procedures may also be devised. The automated system for 
performing such would preferably be relatively easily designed and conceptually easily understood. 

XIII. COMMERCIAL APPLICATIONS 

A. Sequencing 

[0255] As indicated above, sequencing may be performed either de novo or as a verification of another sequencing 
method. The present hybridization technology provides the ability to sequence nucleic acids and polynucleotides de 
novo, or as a means to verify either the Maxam and Gilbert chemical sequencing technique or Sanger and Coulson 
dideoxy- sequencing techniques. The hybridization method is useful to verify sequencing determined by any other 
sequencing technique and to closely compare two similar sequences, e.g., to identify and locate sequence differences 
[0256] Of course, sequencing of can be very important In many different sorts of environments. For example, it will 
be useful in determining the genetic sequence of particular markers in various individuals. In addition, polymers may 
be used as markers or for information containing molecules to encode information. For example, a short polynucleotide 
sequence may be included in large bulk production samples indicating the manufacturer, date, and location of manu- 
facture of a product. For example, various drugs may be encoded with this information with a small number of molecules 
in a batch. For example, a pill may have somewhere from 1 0 to 100 to 1 ,000 or more very short and small molecules 
encoding this information. When necessary, this information may be decoded from a sample of the material using a 
polymerase chain reaction (PCR) or other amplification method. This encoding system may be used to provide the 
origin of large bulky samples without significantly affecting the properties of those samples. For example chemical 
samples may also be encoded by this method thereby providing means for identifying the source and manufacturing 
details of lots. The origin of bulk hydrocarbon samples may be encoded. Production lots of organic compounds such 
as benzene or plastics may be encoded with a short molecule polymer. Food stuffs may also be encoded using similar 
marking molecules. Even toxic waste samples can be encoded determining the source or origin. In this way proper 
disposal can be traced or more easily enforced. 

[0257] Similar sorts of encoding may be provided by f ingerprinting-type analysis. Whether the resolution is absolute 
or less so, the concept of coding information on molecules such as nucleic acids, which can be amplified and later 
decoded, may be a very useful and important application. 

[0258] This technology also provides the ability to include markers for origins of biological materials. For example, 
a patented animal line may be transformed with a particular unnatural sequence which can be traced back to its origin' 
With a selection of multiple markers, the likelihood could be negligible that a combination of markers would have 
independently arisen from a source other than the patented or specifically protected source. This technique may provide 
a means for tracing the actual origin of particular biological materials. Bacteria, plants, and animals will be subject to 
marking by such encoding sequences. 

B. Fingerprinting 

[0259] As indicated above, fingerprinting technology may also be used for data encryption. Moreover, fingerprinting 
allows for significant identification of particular individuals. Where the fingerprinting technology is standardized, and 
used for identification of large numbers of people, related equipment and peripheral processing will be developed to 
accompany the underlying technology. For example, specific equipment may be developed for automatically taking a 
biological sample and generating or amplifying the information molecules within the sample to be used in fingerprinting 
analysis. Moreover, the fingerprinting substrate may be mass produced using particular types of automatic equipment. 
Synthetic equipment may produce the entire matrix simultaneously by stepwise synthetic methods as provided by the 
VLSIPS technology. The attachment of specific probes onto a substrate may also be automated, e.g making use of 
caged biotin technology. 
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[0260] In addition, peripheral processing may be important and may be dedicated to this specific application. Thus, 
automated equipment for producing the substrates may be designed, or particular systems which take in a biological 
sample and output either a computer readout or an encoded instrument, e.g., a card or document which indicates the 
information and can provide that information to others. An identification having a short magnetic strip with a few million 

5 bits may be'used to provide individual identification and important medical information useful in a medical emergency. 
[0261] In fact, data banks may be set up to correlate all of this information of fingerprinting with medical information. 
This may allow for the determination of correlations between various medical problems and specific DNA sequences. 
By collating large populations of medical records with genetic information, genetic propensities and genetic suscepti- 
bilities to particular medical conditions may be developed. Moreover, with standardization of substrates, the micro 

10 encoding data may be also standardized to reproduce the information from a centralized data bank or on an encoding 
device carried on an individual person. On the other hand, if the fingerprinting procedure is sufficiently quick and routine, 
every hospital may routinely perform a fingerprinting operation and from that determine many important medical pa- 
rameters for an individual. 

[0262] In particular industries, the VLSI PS sequencing, fingerprinting, or mapping technology will be particularly 
is appropriate. As mentioned above, agricultural livestock suppliers may be able to encode and determine whether their 
particular strains are being used by others. By incorporating particular markers into their genetic stocks, the markers 
will indicate origin of genetic material. This is applicable to seed producers, livestock producers, and other suppliers 
of medical or agricultural biological materials. 

[0263] This may also be useful in identifying individual animals or plants. For example, these markers may be useful 
20 in determining whether certain fish return to their original breeding grounds, whether sea turtles always return to their 
original birthplaces, or to determine the migration patterns and viability of populations of particular endangered species. 
It would also provide means for tracking the sources of particular animal products. For example, it might be useful for 
determining the origins of controlled animal substances such as elephant ivory or particular bird populations whose 
importation or exportation is controlled. 
25 [0264] As indicated above, polymers may be used to encode important information on source and batch and supplier. 
This is described in greater detail, e.g., "Applications of PCR to industrial problems," (1 990) in Chemical and Engineer- 
ing News 68:145. In fact, the synthetic method can be applied to the storage of enormous amounts of information. 
Small substrates may encode enormous amounts of information, and its recovery will make use of the inherent repli- 
cation capacity. For example, on regions of 1 0 u.m x 1 0 u.m, 1 cm 2 has 1 0 6 regions, in theory, the entire human genome 
30 could be attached in 1000 nucleotide segments on a 3 cm 2 surface. Genomes of endangered species may be stored 
on these substrates, 

[0265] Fingerprinting may also be used for genetic tracing orfor identifying individuals forforensic science purposes. 
See, e.g., Morris, J. et al. (1 989) "Biostatistical Evaluation of Evidence From Continuous Allele Frequency Distribution 
DNA Probes in Reference to Disputed Paternity and Identity," J. Forensic Science 34:1 311-1317, and references pro- 
35 vided therein. 

[0266] In addition, the high resolution fingerprinting allows the distinguishability to high resolution of particular sam- 
ples. As indicated above, new cell classifications may be defined based on combinations of a large number of properties. 
Similar applications will be found in distinguishing different species of animals or plants. In fact, microbial identification 
may become dependent or characterization of the genetic content. Tumors or other cells exhibiting abnormal physiology 
40 will be detectable by use of the present invention. Also, knowing the genetic fingerprint of a microorganism may provide 
very useful Information on how to treat an infection by such organism. 

[0267] Modifications of the fingerprint embodiments may be used to diagnose the condition of the organism. For 
example, a blood sample is presently used for diagnosing any of a number of different physiological conditions. A multi- 
dimensional fingerprinting method made available by the present invention could become a routine means for diag- 
nosing an enormous number of physiological features simultaneously. This may revolutionize the practice of medicine 
in providing information on an enormous number of parameters together at one time. In another way, the genetic 
predisposition may also revolutionize the practice of medicine providing a physician with the ability to predict the like- 
lihood of particular medical conditions arising at any particular moment. It also provides the ability to apply preventative 
medicine. 

so [0268] Also available are kits with the reagents useful for performing sequencing, fingerprinting, and mapping pro- 
cedures. The kits will have various compartments with the desired necessary reagents, e.g., substrate, labeling rea- 
gents for target samples, buffers, and other useful accompanying products. 

C. Mapping 

55 

[0269] The present invention also provides the means for mapping sequences within enormous stretches of se- 
quence. For example, nucleotide sequences may be mapped within enormous chromosome size sequence maps. For 
example, it would be possible to map a chromosomal location within the chromosome which contains hundreds of 
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millions of nucleotide base pairs. In addition, the mapping and fingerprinting embodiments allow for testing of chromo- 
somal translocations, one of the standard problems for which amniocentesis is performed. 

[0270] The present invention will be better understood by reference to the following illustrative examples. The fol- 
lowing examples are offered by way of illustration and not by way of limitation. 
5 [0271 ] Relevant techniques are described in PCT publication no. WO90/1 5070, published December 1 3, 1 990; PCT 
publication no. WO91/07087, published May 30, 1991. 

[0272] Also, additional relevant techniques are described, e.g., in Sambrook, J., et al. (1989) Molecular Cloning: a 
Laboratory Manual , 2d Ed., vols 1-3, Cold Spring Harbor press, New York; Greenstein and Winitz (1961) Chemistry of 
the Amino Acids, Wiley and Sons, New York; Bodzansky, M. (1988) Peptide Chemistry: a Practical Textbook . Springer- 

10 Verlag, New York; Hariow and Lane (1988) Antibodies: A Laboratory Manual, Cold Spring Harbor Press, New York; 
Glover, D. (ed.) (1987) DNA Cloning: A Practical Approach , vols 1-3, IRL Press, Oxford; Bishop and Rawlings (1987) 
Nucleic Acid and Protein Sequence Analysis: A Practical Approach , IRL Press, Oxford; Hames and Higgins (1985) 
Nucleic Acid Hybridisation: A Practical Approach , IRL Press, Oxford; Wu et ai. (1 989) Recombinant DNA Methodology , 
Academic Press, San Diego; Goding (1 986) Monoclonal Antibodies: Principles and Practice, (2d ed.), Academic Press! 

15 San Diego; Finegold and Barron (1986) Bailey and Scott's Diagnostic Microbiology , (7th ed.), Mosby Co., St. Louis- 
Collins et al. (1989) Microbiological Methods, (6th ed.), Butterworth, London; Chaplin and Kennedy (1986) Carbohy- 
drate Analysis: A Practical Approach , IRL Press, Oxford; Van Dyke (ed.) (1985) Bioluminescence and Chemtlumines- 
cence: Instruments and Applications , vol 1 , CRC Press, Boca Rotan; and Ausubel et al. (ed.) (1 990) Current Protocols 
in Molecular Biology , Greene Publishing and Wiley- Interscience, New York. — — 

20 

EXAMPLES 

[0273] The following examples are provided to illustrate the efficacy of the inventions herein. All operations were 
conducted at about ambient temperatures and pressures unless indicated to the contrary. 

25 

POLYNUCLEOTIDE SEQUENCING 

1 . HPLC of the photolysis of 5'-0-nitroveratryl-thyrnidine. 

30 [0274] In order to determine the time for photolysis of 5'-o-nitrovertryl thymidine to thymidine a 1 00 u.M solution of 
N V-Thym-OH (5'-0-nitrovertryl thymidine) in dioxane was made and -200 u.l aliquots were irradiated (in a quartz cuvette 
1 cm x 2 mm) at 362.3 nm for 20 sec, 40 sec, 60 sec, 2 min, 5 min, 10 min, 15 min, and 20 min. The resulting irradiated 
mixtures were then analyzed by HPLC using a Varian MicroPak SP column (C 18 analytical) at a flow rate of 1 ml/min 
and a solvent system of 40% CH 3 CN and 60% water. Thymidine has a retention time of 1 .2 min and NVO-Thym-OH 

35 has a retention time of 2.1 min. it was seen that after 10 min of exposure the deprotection was complete. 

2. Preparation and Detection of Thyaidine-Cytidine dimer (FITC) 
[0275] The reaction is illustrated: 

40 



45 



50 



55 
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[0276] To an aminopropylated glass slide (standard VLSIPS) was added a mixture of the following: 

1 2.2 mg of NVO-Thym-C0 2 H (IX) 
35 3.4 mg of HOBT (N-hydroxybenztriazal) 

8.8 u.l DIEA (Diisopropylethylamine) 
11.1 mg BOP reagent 
2.5mlDMF 

40 [0277] After 2 h coupling time (standard VLSIPS) the plate was washed, acetylated with acetic anhydride/pyridine, 
washed, dried, and photolyzed in dioxane at 362 nm at 14 mW/cm 2 for 10 min using a 500 ujti checkerboard mask. 
The slide was then taken and treated with a mixture of the following: 

107 mg of FMOC-amine modified C (III) 
45 21 mg of tetrazole 

1 ml anhydrous CH 3 CN 

[0278] After being treated for approximately 8 min, the slide was washed off with CH 3 CN, dried, and oxidized with 
I^HgO/THF/lutidine for 1 min. The slide was again washed, dried, and treated for 30 min with a 20% solution of DBU 
so in DMF. After thorough rinsing of the slide, it was next exposed to a FITC solution (1 mM fluorescein isothiocyanate 
[FITC] in DMF) for 50 min, then washed, dried, and examined by fluorescence microscopy. This reaction is illustrated: 



55 
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3. Preparation and Detection of Thymidine-Cytidine dimer (Biotin) 

[0279] An aminopropyl glass slide, was soaked in a solution of ethylene oxide (20% in DMF) to generate a hydrox- 
ylated surface. The slide was added a mixture of the following: 

32 mg of NVO-T-OCED (X) 

11 mg of tetrazole 

0.5 ml of anhydrous CH 3 CN 

[0280] After 8 min the plate was then rinsed with acetonitrile, then oxidized with \^H 2 OfTH F/lutidine for 1 min, washed 
and dried. The slide was then exposed to a 1 :3 mixture of acetic anhydride:pyrldine for 1 h, then washed and dried. 
The substrate was a then photolyzed in dioxane at 362 nm at 14 mW/cm 2 for 10 min using a 500ujti checkerboard 
mask, dried, and then treated with a mixture of the following: 

65 mg of biotin modified C (IV) 

11 mg of tetrazole 

0.5 ml anhydrous CH3CN 

[0281] After 8 min the slide was washed with CH3CN then oxidized with I^O/THF/iutidine for 1 min, washed, and 
then dried. The slide was then soaked for 30 min in a PBS/0.05% Tween 20 buffer and the solution then shaken off. 
The slide was next treated with FITC-labeled streptavidin at 10 u.g/ml in the same buffer system for 30 min. After this 
time the streptavidin-buffer system was rinsed off with fresh PBS/0.05% Tween 20 buffer and then the slide was finally 
agitated in distilled water for about 1/2 h. After drying, the slide was examined by fluorescence microscopy (see Fiq 
2 and Fig. 3). 



40 



EP 0 834 575 B1 



4. substrate preparation 

[0282] Before attachment of reactive groups it is preferred to clean the substrate which is, in a preferred embodiment, 
a glass substrate such as a microscope slide or cover slip. A roughened surface will be useable but a plastic or other 

5 solid substrate is also appropriate. According to one embodiment the slide is soaked in an alkaline bath consisting of, 
e.g. , 1 liter of 95% ethanol with 1 20 ml of water and 120 grams of sodium hydroxide for 1 2 hours. The slides are washed 
with a buffer and under running water, allowed to air dry, and rinsed with a solution of 95% ethanol. 
[0283] The slides are then aminated with, e.g., aminopropyltriethoxysilaneforthe purpose of attaching amino groups 
to the glass surface on linker molecules, although other omega functionalized silanes could also be used for this pur- 

10 pose. In one embodiment 0.1% aminopropyltriethoxysilane is utilized, although solutions with concentrations from 
10 _7 % to 10% may be used, with about 10*3% to 2% preferred. A 0.1% mixture is prepared by adding to 100 ml of a 
95% ethanol/5% water mixture, 100 microliters (u.l) of aminopropyltriethoxysilane. The mixture is agitated at about 
ambient temperature on a rotary shaker for an appropriate amount of time, e.g., about 5 minutes. 500 |xl of this mixture 
is then applied to the surface of one side of each cleaned slide. After 4 minutes or more, the slides are decanted of 

15 this solution and thoroughly rinsed three times or more by dipping in 100% ethanol. 

[0284] After the slides dry, they are heated in a 110-120°C vacuum oven for about 20 minutes, and then allowed to 
cure at room temperature for about 12 hours in an argon environment. The slides are then dipped into DMF (dimeth- 
ylformamide) solution, followed by a thorough washing with methylene chloride. 

20 5. linker attachment, blocking of free sites 

[0285] The aminated surface of the slide is then exposed to about 500 u,l of, for example, a 30 millimolar (mM) solution 
of NVOC-nucleotide- NHS (N-hydroxysuccinimide) in DMF for attachment of a NVOC-nucleotide to each of the amino 
groups. See, e.g., SIGMA Chemical Company for various nucleotide derivatives. The surface is washed with, for ex- 

25 ample, DMF, methylene chloride, and ethanol. 

[0286] Any unreacted aminopropyl silane on the surface, i.e., those amino groups which have not had the NVOC- 
nucleotide attached, are now capped with acetyl groups (to prevent further reaction) by exposure to a 1 :3 mixture of 
acetic anhydride in pyridine for 1 hour. Other materials which may perform this residual capping function include trif- 
luoroacetic anhydride, formicacetic anhydride, or other reactive acylating agents. Finally, the slides are washed again 

30 with DMF, methylene chloride, and ethanol. 

6. synthesis of eight trimers of C and T 

[0287] Fig. 4 illustrates a possible synthesis of the eight trimers of the two-monomer set: cytosine and thymine (rep- 
35 resented by C and T, respectively). A glass slide bearing silane groups terminating in 6-nitroveratryloxycarboxamide 
(NVOC-NH) residues is prepared as a substrate. Active esters (pentafluorophenyl, OBt, etc.) of cytosine and thymine 
protected at the 5' hydroxyl group with NVOC are prepared as reagents. While not pertinent to this example, if side 
chain protecting groups are required for the monomer set, these must not be photoreactive at the wavelength of light 
used to protect the primary chain. 
40 [0288] For a monomer set of size n, n x t cycles are required to synthesize all possible sequences of length i. A 
cycle consists of: 

1 . Irradiation through an appropriate mask to expose the 5-OH groups at the sites where the next residue is to be 
added, with appro priate washes to remove the by-products of the deprotectlon. 
45 2. Addition of a single activated and protected (with the same photochemicaliy-removable group) monomer, which 

will react only at the sites addressed in step 1 , with appropriate washes to remove the excess reagent from the 
surface. 

[0289] The above cycle is repeated for each member of the monomer set until each location on the surface has been 
so extended by one residue in one embodiment. In other embodiments, several residues are sequentially added at one 
location before moving on to the next location. Cycle times will generally be limited by the coupling reaction rate, now 
as short as about 10 min in automated oligonucleotide synthesizers. This step is optionally followed by addition of a 
protecting group to stabilize the array for later testing. For some types of polymers (e.g., peptides), a final deprotection 
of the entire surface (removal of photoprotective side chain groups) may be required. 
55 [0290] More particularly, as shown in Fig. 4A, the glass 20 Is provided with regions 22, 24, 26, 28, 30, 32, 34, and 
36. Regions 30, 32, 34, and 36 are masked, indicated by the hatched regions, as shown in Fig. 4B and the glass is 
irradiated by the bright regions 22, 24, 26, and 28, and exposed to a reagent containing a photosensitive blocked C 
(e.g., cytosine derivative), with the resulting structure shown in Fig. 4C. The substrate is carefully washed and the 
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is SS^T h r ; n, 9 '° nS 22> 24 ' 26 ' 3nd 28 are maSked - 38 indicated b V tne ha, <* ad region, the glass 

iI2S£? f "V* ?>' 33 indiCat6d by ,he bn ' 9ht rcgi0nS ' at 30 " 32 ' * and 3B - and e *P° 88d to a photo- 
sensitive blocked reagent containing T (e.g.. thymine derivative), with the resulting structure shown in Fig 4E The 

Zl^fTT' C ° nSeC ^ maskin 9 and ex P° si "9 the sections as shown until the structure shown in Fig 4M is 
obtained The glass is .rradated and the terminal groups are. optionally, capped by acetylation. As shown all possible 
tnmers of cytosine/thymine are obtained. ' P 0881018 

Edes' HI t dSf '1° Si K 6 Pr ° teCtiVe 9r0UP rem ° Val fe neCeSSary ' 33 mi 9 ht be c <~ in ^^ified nu- 
Sc acid cleprotecuon may be accomplished by treatment with ethanedithiol and trifluoro- 

[0292J In general, the number of steps needed to obtain a particular polymer chain is defined by: 



nx€ (1) 



where: 



n = the number of monomers in the basis set of monomers, and 
€ = the number of monomer units in a polymer chain. 

[0293] Conversely, the synthesized number of sequences of length € will be: 



(2) 



Slrc^! 01 " 86 ', 3T T' , diVerSity iS ° btained by USinS maSkin9 Strategies which wi " also includ8 the ^"thesis of 

S TnLS t n9th h 0f l6 f S tha " e - in the eXtreme case ' 311 P°'y mers havi "9 a ' 888 than or equal to € 
are synthesized, the number of polymers synthesized will be: 



n' + n*' 1 + ... + n\ ^ 



[0295] The maximum number of lithographic steps needed will generally be n for each "layer" of monomers, i e the 
mlJi! 1Umberofmas ks .( and -* ere,ora .thenumberoflimographicsteps) needed willbenx€.Thesizeofthetranspa'rent 
to be formed. In general, the size of the synthesis areas will be: 

size of synthesis areas = (A)/(S) 

where: 

A is the total area available for synthesis; and 
S is the number of sequences desired in the area. 

[0296] It will be appreciated by those of skill in the art that the above method could readily be used to simultaneously 
produce thousands or millions of oligomers on a substrate using the photolithographic techniques dteclosed herein 
Consequently, the method results in the ability to practically test large numbers of, for example, di. tri tetra penta' 
hexa, hepta, octa, nona, deca, even dodecanucleotides, or larger polynucleotides. 

[0297] The above example has illustrated the method by way of a manual example. It will of course be appreciated 
that automated or sem.-automated methods could be used. The substrate would be mounted in a flowcell forautomated 
addition and removal of reagents, to minimize the volume of reagents needed, and to more carefully control reaction 
condrtions. Successive masks will be applicable manually or automatically. See. e.g., PCTpublication no. WO90/15070. 

7. labeling of target 

[0298] The target oligonucleotide can be labeled using standard procedures referred to above. As discussed for 
certain situations, a reagent which recognizes interaction, e.g.. ethidium bromide, may be provided in the detection 
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step. Alternatively, fluorescence labeling techniques may be applied, see, e.g., Smith, et al. (1986) Nature, 321: 
674-679; and Prober, et al. (1 987) Science, 238:336-341 . The techniques described therein will be followed with minimal 
modifications as appropriate for the label selected. 

5 8. dimers of A, C, G, and T 

[0299] The described technique may be applied, with photosensitive blocked nucleotides corresponding to adenine, 
cytosine, guanine, and thymine, to make combinations of polynucleotides consisting of each of the four different nu- 
cleotides. All 1 6 possible dimers would be made using a minor modification of the described method. 

10 

9. 1 0-mers of A, C, G, and T 

[0300] The described technique for making dimers of A, C, G, and T may be further extended to make longer oligo- 
nucleotides. The automated system described, e.g., in PCT publication no. WO90/15070 can be adapted to make all 
15 possible 1 0-mers composed of the 4 nucleotides A, C, G, and T. The photosensitive, blocked nucleotide analogues 
have been described above, and would be readily adaptable to longer oligonucleotides. 

1 0. specific recognition hybridization to 1 0mers 

20 [0301] The described hybridization conditions are directly applicable to the sequence specific recognition reagents 
attached to the substrate, produced as described immediately above. The 1 0-mers have an inherent property of hy- 
bridizing to a complementary sequence. For optimum discrimination between full-matching and some mismatch, the 
conditions of hybridization should be carefully selected, as described above. Careful control of the conditions, and 
titration of parameters should be performed to determine the optimum collective conditions. 

25 

11. hybridization 

[0302] Hybridization conditions are described in detail, e.g., in Hames and Higgins (1 985) Nucleic Acid Hybridisation: 
A Practical Approach ; and the considerations for selecting particular conditions are described, e.g., in Wetmur and 

30 Davidson, (1988) J. Mol. Biol. 31:349-370, and Wood et al. (1985) Proc. Natl. Acad. Sci. USA 82:1585-1588. As de- 
scribed above, conditions are desired which can distinguish matching along the entire length of the probe from where 
there is one or more mismatched bases. The length of incubation and conditions will be similar, in many respects, to 
the hybridization conditions used in Southern blot transfers. Typically, the GC bias may be minimized by the introduction 
of appropriate concentrations of the aikylammonium buffers, as described above. 

35 [0303] Titration of the temperature and other parameters is desiredto determinethe optimum conditions for specificity 
and distinguish ability of absolutely matched hybridization from mismatched hybridization. 

[0304] A fluorescently labeled target or set of targets are generated, as described in Prober, et al. (1 987) Science 
238:336-341 , or Smith, et al. (1986) Nature 321 :674-679. Preferably, the target or targets are of the same length as, 
or slightly longer, than the oligonucleotide probes attached to the substrate and they will have known sequences. Thus, 

40 only a few of the probes hybridize perfectly with the target, and which particular ones did would be known. 

[0305] The substrate and probes are incubated under appropriate conditions for a sufficient period of time to allow 
hybridization to completion. The time is measured to determine when the probe-target hybridizations have reached 
completion. A salt buffer which minimizes GC bias is preferred, incorporating, e.g., buffer, such as tetramethyl ammo- 
nium or tetraethyl ammonium Ion at between about 2.4 and 3.0 M. See Wood, et al. (1 985) Proc. Nat'l Acad. Sci. USA 

45 82:1585-1588. This time is typically at least about 30 min, and may be as long as about 1-5 days. Typically very long 
matches will hybridize more quickly, very short matches will hybridize less quickly, depending upon relative target and 
probe concentrations. The hybridization will be performed under conditions where the reagents are stable for that time 
duration. 

[0306] Upon maximal hybridization, the conditions for washing are titrated. Three parameters initially titrated are 
50 time, temperature, and cation concentration of the wash step. The matrix is scanned at various times to determine the 
conditions at which the distinguishability between true perfect hybrid and mismatched hybrid is optimized. These con- 
ditions will be preferred in the sequencing embodiments. 

12. positional detection of specific interaction 

55 

[0307] As indicated above, the detection of specific interactions may be performed by detecting the positions where 
the labeled target sequences are attached. Where the label is a fluorescent label, the apparatus described, e.g., PCT 
publication no. WO90/15070 may be advantageously applied. In particular, the synthetic processes described above 
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beZZe^T Patt6 T- ° f SPed,iC SeqUenC6S 3ttaChed t0 the Substrate - and a known P attem ° f interactions can 
be converted to corresponding sequences. 

[0308] In an alternative embodiment, a separate reagent which differentially interacts with the probe and interacted 
probe targe ts can ind.cate where interaction occurs or does not occur A single-strand specific reagent win indicate 

T: 2 TZTr *Z t8ken Whi,S 3 d0uble " strand *«* -agent will indicate where interna Sen 
place. An .ntercalating dye, e.g., eth.dium bromide, may be used to indicate the positions of specific interaction. 

13. analysis 

EI n^rf 0 " °' , thS pos " iona ' data int0 set > uence sP^fi^ will provide the set of subsequences whose anal- 
El ? Segments ' may be Permed, as described above. Analysis is provided by the methodology described 
above or usmg, e.g., software available from the Genetic Engineering Center, P.O. Box 794, 11000 Belgrade Yugo- 
slavia (Yugoslav group). See, also, Macevicz, PCT publication no. WO 90/04652. 

POLYNUCLEOTIDE FINGERPRINTING 

[031 0] The above section on generation of reagents for sequencing provides specific reagents useful for flngerprint- 
,ng appl.cat.ons. Rngerprinting embodiments may be applied towards pofynucleotide fingerprinting, cell and Tssue 
classical ..on, cell and tissue temporal development stage classification, diagnostic tests, forensic uses for individual 
|denW.cat.on,classrf,cat.on of organisms, and genetic screening of indMduals. Mapping applications are also deSed 

[0311] Polynucleotide fingerprinting may use reagents similar to those described above for probing a sequence for 
he presence of specific subsequences found therein. Typically the subsequences used for fingerprinting will be longer 
than hesequence^ 

the similanty of drfferent samples of nucleic acids. They may also be used to fingerprintwhether specific cJbtaZs 
of ^formation are provided therein. Particular probe sequences are selected and attached in a positional m^ner to a 
nS£HU!T? attacnmen '. ma y be a method usi "3 feting molecules. In one embodiment, an unnatural 

by directed towards complementary sequences on a VLSIPS substrate. Typically, unnatural nucleotides would be pre- 
ferred, e.g., unnatural optical isomers, which would not interfere with natural nucleotide interactions 

h L Hav,n9P L 0duced ^ 

he substrate may be used .n a manner quite similar to the sequencing embodiment to provide information as to whether 
the mgerpnntprobes are detectingthecorrespondingsequenceinatarget sequence. This will often provide information 
similar to a Southern blot hybridization. viuoimurmauon 

Temporal Development 
Developmental RNA expression patterns 

[031 3] The present fingerprinting invention also allows cell classification by identification of developmental RNA ex- 
pression patterns. For example, a lymphocyte stem cell expresses a particular combination of RNA species. As the 
lymphocyte develops through a program developmental scheme, at various stages it expresses particular RNA species 

of specific structural features which are d.agnostic of developmental or functional features which will allow classification 
of c* into temporal developmental classes. Celte, products of those cells, or fysates of those cells will be assayed 
to determine the developmental stage of the source cells. In this manner, once a developmental stage is defined 
specific synchronized populations of cells will be selected out of another population. These synchronized populations 
may be very important in determining the biological mechanisms of development. 

[0314] The present invention also allows forfingerprintJng of the mRNA population of a cell. Inthisfashion themRNA 
Population which should be a good determinant of developmental stage, will be correlated with other structural features 

IZIhT h T 3 "?.^ Ce " S 31 Sf>eCifiC devel °P mental stages will be charactered by the intracellular environment, 
as well as the extracellular environment. 

Diagnostic Tests 

[031 5] The present invention also provides the ability to perform diagnostic tests. Diagnostic tests typically are based 
upon a fingerpnnt type assay, which tests for the presence of specific diagnostic polynucleotides. Thus the present 
invention provides means for viral strain identification, bacterial strain identification, and other diagnostic tests using 
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positionally defined specific oligonucleotide reagents. 
Viral Identification 

5 [031 6] The present invention provides reagents and methodology for identifying viral strains. The viral genome may 
be probed for specific sequences which are characteristic of particular viral strains. Specific hybridization patterns on 
an VLSIPS oligonucleotide substrate can identify the presence of particular viral genomes. 

Bacterial Identification 

10 

[0317] Similar techniques will be applicable to identifying a bacterial source. This may be useful in diagnosing bac- 
terial infections, or in classifying sources of particular bacterial species. For example, the bacterial assay may be useful 
in determining the natural range of survivability of particular strains of bacteria across regions of the country or in 
different ecological niches. 

15 

Other Microbiological Identifications 

[0318] The present invention provides means for diagnosis of other microbiological and other species, e.g., protozoal 
species and parasitic species in a biological sample, but also provides the means for assaying a combination of different 
20 infections. For example, a biological specimen may be assayed for the presence of any or all of these microbiological 
species, in human diagnostic uses, typical samples will be blood, sputum, stool, urine, or other samples. 

Individual Identification 

25 [0319] The present invention provides the ability to fingerprint and identify a genetic individual. This individual may 
be a bacterial or lower microorganism, as described above in diagnostic tests, or of a plant or animal. An individual 
may be identified genetically, as described. 

[0320] Genetic fingerprinting has been utilized in comparing different related species in Southern hybridization blots. 
Genetic fingerprinting has also been used in forensic studies, see, e.g., Morris et al. (1989) J. Forensic Science 34: 

30 1311-1317, and references cited therein. As described above, an individual may be identified genetically by a sufficiently 
large number of probes. The likelihood that another individual would have an identical pattern over a sufficiently large 
number of probes may be statistically negligible. However, it is often quite important that a large number of probes be 
used where the statistical probability of matching is desired to be particularly low. In fact, the probes will optimally be 
selected for having high heterogeneity among the population. In addition, the fingerprint method may make use of the 

35 pattern of homologies indicated by a series of more and more stringent washes. Then, each position has both a se- 
quence specificity and a homology measurement, the combination of which greatly increases the number of dimensions 
and the statistical likelihood of a perfect pattern match with another genetic individual. 

Genetic Screening 

1 . test alleles with markers 

[0321] The present invention provides for the ability to screen for genetic variations of individuals. For example, a 
number of genetic diseases are linked with specific alleles. See, e.g., Scriber, C. et al. (eds.) (1989) The Metabolic 

*5 Bases of Inherited Disease, McGraw-Hill, New York. In one embodiment, cystic fibrosis has been correlated with a 
specific gene, see, Gregory et al. (1990) Nature 347: 382-386. A number of alleles are correlated with specific genetic 
deficiencies. See, e.g., McKusick, V. (1990) Genetic Inheritance in Man: Catalogs of Autosomal Dominant, Autosomal 
Recessive, and X-linked Phenotypes , Johns Hopkins University Press, Baltimore; Ott, J. (1985) Analysis of Human 
Genetic Linkage, Johns Hopkins University Press, Baltimore; Track, R. et al. (1989) Banbury Report 32: DNA Tech- 

so nology and Forensic Science, Cold Spring Harbor Press, New York. 

2. Amniocentesis 

[0322] Typically, amniocentesis is used to determine whether chromosome translocations have occurred. The map- 
55 ping procedure may provide the means for determining whether these translocations have occurred, and for detecting 
particular alleles of various markers. 
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MAPPING 

Positionally Located Clones 

[0323] The present invention allows for the positional location of specific clones useful for mapping. For example 
caged biotin may be used for specifically positioning a probe to a location on a matrix pattern. 
[0324] In addition, the specific probes may be positionally directed to specific locations on a substrate by targeting 
For example, polypeptide specific recognition reagents may be attached to oligonucleotide sequences which can be 
complementarily targeted, by hybridization, to specific locations on a VLSIPS substrate. Hybridization conditions as 
applied for oligonucleotide probes, will be used to target the reagents to locations on a substrate having complementary 
oligonucleotides synthesized thereon. In another embodiment, oligonucleotide probes may be attached to specific 
polypeptide targeting reagents such as an antigen or antibody. These reagents can be directed towards a complemen- 
tary antigen or antibody already attached to a VLSIPS substrate. 

[0325] In another embodiment, an unnatural nucleotide which does not interfere with natural nucleotide complemen- 
tary hybridization may be used to target oligonucleotides to particular positions on a substrate. Unnatural optical isomers 
of natural nucleotides should be ideal candidates. 

[0326] In this way, short probes may be used to determine the mapping of long targets or long targets may be used 
to map the position of shorter probes. See, e.g., Craig et al. 1990 Nuc. Acids Res. 18: 2653-2660. 

Positionally Defined Clones 

[0327] Positionally defined clones may be transferred to a new substrate by either physical transfer or by synthetic 
means. Synthetic means may involve either a production of the probe on the substrate using the VLSIPS synthetic 
methods, or may involve the attachment of a targeting sequence made by VLSIPS synthetic methods which will target 
that positionally defined clone to a position on a new substrate. Both methods will provide a substrate having a number 
of positionally defined probes useful in mapping. 

CONCLUSION 

[0328] The present inventions provide greatly improved methods and apparatus for synthesis of polymers on sub- 
strates. It is to be understood that the above description is intended to be illustrative and not restrictive. Many embod- 
iments will be apparent to those of skill in the art upon reviewing the above description. By way of example, the invention 
has been described primarily with reference to the use of photoremovable protective groups, but it will be readily 
recognized by those of skill in the art that sources of radiation other than light could also be used. For example, in 
some embodiments it may be desirable to use protective groups which are sensitive to electron beam irradiation' x- 
ray irradiation, in combination with electron beam lithograph, or x-ray lithography techniques. Alternatively, the group 
could be removed by exposure to an electric current. The scope of the invention should, therefore, be determined not 
with reference to the above description, but should instead be determined with reference to the appended claims along 
with the full scope of equivalents to which such claims are entitled. 

Claims 

1 . A method for identifying or distinguishing a target nucleic acid in a sample comprising: 

(a) providing an array of at least 100 different probes bound to a substrate in known locations and at a density 
of at least 1 000 probes per square centimetre; 

(b) applying the sample to the substrate to obtain a hybridization pattern of the sample; and 

(c) comparing the hybridization pattern with a reference pattern to identify or distinguish the target nucleic acid. 

2. A method as claimed in claim 1 , wherein the reference pattern is obtained by applying a second nucleic acid to 
the or another said substrate. 

3. A method as claimed in claim 1 , wherein the reference pattern comprises a reference database of a plurality of 
sources of nucleic acid and the comparison of sample and reference patterns permits an identification of the source 
of the sample. 

4. A method as claimed in any of claims 1 to 3, wherein at least part of the sequence of each of the probes is known. 



46 



EP 0 834 575 B1 



5. A method as claimed in any of claims 1 to 3, wherein at least some of the probes are oligonucleotides. 

6. A method as claimed in any of claims 1 to 4, wherein the sample hybridization pattern is analysed to generate a 
partial nucleotide sequence for the sample nucleic acid, and the partial nucleotide sequence is compared with a 
nucleotide sequence of the reference. 

7. A method as claimed in any of claims 1 to 6, wherein the second nucleic acid is from an individual and the result 
of comparing the sample and reference hybridization patterns determines whether the test sample is from the 
individual. 

8. A method as claimed in any of claims 3 to 5, wherein the sample is from an individual and the plurality of sources 
of nucleic acid are provided by potential relatives of the individual, thereby permitting identification of a genealogy 
of the individual. 

9. A method as claimed in any of claims 3 to 5, wherein the sample is from an organism and the sources of nucleic 
acid are provided by a plurality of known individuals, thereby permitting identification of the organism as one of 
the known individuals. 

10. A method as claimed in claim 1 , wherein the sample is from an abnormal, e.g. tumour, neoplastic, diseased or 
infected, tissue and contains transcripts of the abnormal tissue, and the reference pattern is a reference database 
having expression patterns for a plurality of known abnormalities of tissue, the comparison of sample and reference 
patterns permitting an identification of abnormal tissue. 

11. A method as claimed in claim 1 , wherein the sample is from a tissue and contains RNA of the tissue, and the 
reference pattern is a reference database having expression patterns for a plurality of known cells, the comparison 
of sample and reference patterns permitting an identification of the cellular composition, degree of cellular differ- 
entiation, stage of cellular development or metastatic potential of the tissue. 

12. A method as claimed in claim 1, wherein the sample is from a microbe and contains nucleic acid of the microbe, 
and the reference pattern is a reference database, the comparison of sample and reference patterns permitting 
an identification of the microbe. 

13. A method as claimed in claim 12, wherein the microbe is selected from the group consisting of protozoa, virus and 
bacteria. 

14. A method as claimed in any preceding claim, wherein the array has at least 10 3 , preferably at least 10 4 , more 
preferably at least 10 5 , even more preferably at least 10 6 different probes bound to the substrate. 

15. A method as claimed in any preceding claim, wherein the probes are bound to the substrate at a density of at least 
10 4 , preferably at least 10 5 , more preferably at least 10 6 known locations per square centimetre. 

16. A method as claimed in any preceding claim, wherein the probes are more than 1 5, preferably more than 25, more 
preferably more than 50 nucleotides in length. 



PatentansprUche 

1. Ein Verfahren zur Identlflzierung und Unterscheidung einer Ziel-Nukleinsaure in einer Probe, umfassend: 

(a) Bereitstellen einer Anordnung (Array) von wenigstens 100 verschiedenen Sonden, gebunden an ein Sub- 
strat in bekannter Position und mit einer Dichte von wenigstens 1000 Sonden pro Quadratzentimenter; 

(b) Zufugen der Probe zu dem Substrat, urn ein Hybridisierungsmuster der Probe zu erhalten; und 

(c) Vergleichen des Hybridisierungsmusters mit einem Referenzmuster, urn die Ziel-Nukleinsaure zu identifi- 
zieren oder zu unterscheiden. 

2. Verfahren gemaB Anspruch 1 , wobei das Referenzmuster erhalten wird durch Zufugen einer zweiten Nukleinsaure 
zu dem oder einem anderen genannten Substrat. 
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4. Verfahren gemaB einem der Anspruche 1 bis 3. wobei wenigstens ein Tail der Sequenz jeder Sonde bekannt ist. 

5. Verfahren gemaS einem der Anspruche 1 bis 3. wobei wenigstens einige der Sonden O.igonukieotide sind 

7 z^SF^-^~^~^^ 



von 



Revendlcatlons 

1 . Methode pour identifier ou distinguer un acide nucleique cible dans un echantillon 



consistant a : 
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a) fournir un ensemble d'au moins 100 sondes differentes liees a un substrat dans des localisations connues 
et avec une denslte d'au moins 1 000 sondes par centimetre carre ; 

b) appliquer I'echantillon au substrat pour obtenir un motif d'hybridation de rechantillon ; et 

c) comparer le motif d'hybridation avec un motif de reference pour identifier ou distinguer I'acide nucleique 
5 cible. 

2. Methode selon ia revendication 1 , dans laquelle le motif de reference est obtenu en appliquant un deuxieme acide 
nucleique sur (edit substrat ou sur un autre. 

10 3. Methode selon la revendication 1 , dans laquelle le motif de reference comprend une base de donn6es de reference 
de plusieurs sources d'acide nucleique et la comparaison de I'echantillon avec des motifs de reference permet 
une identification de la source de i'echantillon. 

4. Methode selon i'une quelconque des revendications 1 a 3, dans laquelle au moins une partie de la sequence de 
15 chacune des sondes est connue. 

5. Methode selon Tune quelconque des revendications 1 a 3, dans laquelle au moins certaines sondes sont des 
oligonucleotides. 

20 6. Methode selon I'une quelconque des revendications 1 a 4, dans laquelle le motif d'hybridation de I'echantillon est 
analyse pour g6n6rer une sequence de nucleotide partielle pour I'echantillon d'acide nucleique, la sequence de 
nucleotide partielle etant comparee a une sequence de nucleotide de reference. 

7. Methode selon i'une quelconque des revendications 1 a 6, dans laquelle le deuxieme acide nucleique provient 
25 d'un individu et le resultat de la comparaison de rechantillon avec des motifs d'hybridation de reference determine 

si I'echantillon test provient d'un individu. 

8. Methode selon I'une quelconque des revendications 3 a 5, dans laquelle I'echantillon provient d'un individu et la 
pluralite des sources d'acide nucleique est foumie par des parents potentieis de Pindividu, ce qui permet i'identi- 

30 fication de la genealogie de I'individu. 

9. Methode selon I'une quelconque des revendications 3 a 5, dans laquelle rechantillon provient d'un organisme et 
les sources d'acide nucleique est fournie par plusieurs individus connus, ce qui permet d'identifier I'organisme 
comme appartenant a I'un des individus connus. 

35 

10. Methode selon la revendication 1 , dans laquelle rechantillon provient d'un tissu anormal, par exemple une tumeur, 
un tissu neoplastique, pathologique ou infecte et contient des transcripts du tissu anormal, le motif de reference 
etant une base de donn6es de reference possedant des motifs d'expression pour une pluralite d'anomalies des 
tissus connues, la comparaison de i'echantillon et des motifs de reference permettant une identification du tissu 

40 anormal. 

11. Methode selon la revendication 1, dans laquelle I'echantillon provient d'un tissu et contient un ARN du tissu, le 
motif de reference etant une base de donn6es de reference poss6dant des motifs d'expression pour une pluralite 
de cellules connues, la comparaison de I'echantillon et des motifs de reference permettant une identification de 

45 la composition cellulaire, du degr6 de differentiation cellulaire, du stade de developpement cellulaire ou du potentiel 

metastatique des tissus. 

12. Methode selon la revendication 1 , dans laquelle I'echantillon provient d'un microbe et contient I'acide nucleique 
du microbe, le motif de reference etant une base de donnees de reference, la comparaison de i'echantillon et des 

so motifs de reference permettant une identification du microbe. 

13. Methode selon la revendication 12, dans laquelle le microbe est s6lectionn6 a partir du groupe comprenant les 
protozoaires, les virus et les bacteries. 

55 14. Methode selon I'une quelconque des revendications precedentes, dans laquelle i'ensemble est constitue d'au 
moins 10 3 , de preference d'au moins 10 4 , de facon davantage preferee d'au moins 10 s , et mieux encore d'au 
moins 10 6 sondes differentes Ii6es au substrat. 
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